Team Members
Chong Zhan Hang (24077643)
Shi Yaqi (24227144)
Long Yun (24213952)
Liang Shengbao (24200736)
Huang Xinyi (25058110)
This project applies data science techniques to a large-scale, real-world administrative dataset to demonstrate data cleaning, exploratory analysis, and predictive modelling using R.
The NYC Citywide Payroll dataset was selected due to its size, complexity, and inherent data quality issues, which closely reflect practical challenges encountered in organisational and public-sector analytics.
The dataset contains detailed employee, payroll, and job-related information across multiple city agencies, making it well suited for both regression and classification problems.
Through this project, the group aims to transform raw, noisy payroll data into a structured analytical dataset and extract meaningful insights related to employee compensation and employment status.
The objectives of this project are as follows:
This section covers loading the raw CSV data and filtering it to include only records from the latest three consecutive fiscal years. Load the original CSV dataset and filter for records from the latest three consecutive fiscal years.
R Code
library(readr)
library(dplyr)
# Load the dataset
df <- read_csv('Citywide_Payroll_Data__Fiscal_Year_.csv', show_col_types = FALSE)
# Identify and filter for the latest three fiscal years
unique_fiscal_years <- unique(df$`Fiscal Year`)
sorted_fiscal_years <- sort(unique_fiscal_years, decreasing = TRUE)
latest_three_fiscal_years <- head(sorted_fiscal_years, 3)
df_filtered <- df %>% filter(`Fiscal Year` %in% latest_three_fiscal_years)
cat(paste("Original data loaded and filtered for latest three fiscal years:", toString(latest_three_fiscal_years), "\n"))
## Original data loaded and filtered for latest three fiscal years: 2017, 2016, 2015
head(df_filtered)
## # A tibble: 6 × 16
## `Fiscal Year` `Agency Name` `Last Name` `First Name` `Mid Init`
## <dbl> <chr> <chr> <chr> <chr>
## 1 2016 DISTRICT ATTORNEY-MANHATTAN ABA'AHMID RAHASHEEM E
## 2 2016 DISTRICT ATTORNEY-MANHATTAN ABENSUR MARGARET <NA>
## 3 2016 DISTRICT ATTORNEY-MANHATTAN ABOUNAOUM ANDREA L
## 4 2016 DISTRICT ATTORNEY-MANHATTAN ABRAHAM JONATHAN J
## 5 2016 DISTRICT ATTORNEY-MANHATTAN ABRAMS JOSEPH <NA>
## 6 2016 DISTRICT ATTORNEY-MANHATTAN ABREU JENNIFER <NA>
## # ℹ 11 more variables: `Agency Start Date` <chr>,
## # `Work Location Borough` <chr>, `Title Description` <chr>,
## # `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## # `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## # `Total OT Paid` <chr>, `Total Other Pay` <chr>
Explanation First, the necessary libraries
readr and dplyr are loaded. The
Citywide_Payroll_Data__Fiscal_Year_.csv file is then loaded
into a dataframe named df. To identify the latest three
fiscal years, unique fiscal years are extracted, sorted in descending
order, and the top three are selected. Finally, df_filtered
is created by filtering the original df to include only
records corresponding to these three latest fiscal years. This ensures
that our analysis focuses on the most recent data.
This section details the creation of unique employee identifiers and
the subsequent structured sampling of the dataset. Create an
approximately unique employee_id by combining ‘Last Name’,
‘First Name’, and ‘Agency Start Date’. Then, randomly sample 2500 unique
employee_ids from the filtered data.
R Code
df_filtered <- df_filtered %>%
mutate(
`Last Name` = ifelse(is.na(`Last Name`), "", as.character(`Last Name`)),
`First Name` = ifelse(is.na(`First Name`), "", as.character(`First Name`)),
`Agency Start Date` = ifelse(is.na(`Agency Start Date`), "", as.character(`Agency Start Date`)),
employee_id = paste0(`Last Name`, `First Name`, `Agency Start Date`)
)
unique_employee_ids <- unique(df_filtered$employee_id)
sampled_employee_ids <- sample(unique_employee_ids, min(2500, length(unique_employee_ids)))
df_sampled <- df_filtered %>% filter(employee_id %in% sampled_employee_ids)
cat(paste("Number of unique employee IDs created:", length(unique_employee_ids), "\n"))
## Number of unique employee IDs created: 642096
cat(paste("Number of employee IDs sampled:", length(sampled_employee_ids), "\n"))
## Number of employee IDs sampled: 2500
cat(paste("Number of records after sampling:", nrow(df_sampled), "\n"))
## Number of records after sampling: 6510
head(df_sampled)
## # A tibble: 6 × 17
## `Fiscal Year` `Agency Name` `Last Name` `First Name` `Mid Init`
## <dbl> <chr> <chr> <chr> <chr>
## 1 2016 DISTRICT ATTORNEY-MANHATTAN COHN ADDAM J
## 2 2016 DISTRICT ATTORNEY-MANHATTAN CORNELL JACQUELINE E
## 3 2016 DISTRICT ATTORNEY-MANHATTAN KERMANI LEILA M
## 4 2016 DISTRICT ATTORNEY-MANHATTAN SANGERMANO GREGORY M
## 5 2016 DISTRICT ATTORNEY-MANHATTAN SCOTT GAINSWORTH <NA>
## 6 2016 DISTRICT ATTORNEY-MANHATTAN WALTON SHERITA M
## # ℹ 12 more variables: `Agency Start Date` <chr>,
## # `Work Location Borough` <chr>, `Title Description` <chr>,
## # `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## # `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## # `Total OT Paid` <chr>, `Total Other Pay` <chr>, employee_id <chr>
Explanation An employee_id is created
by concatenating Last Name, First Name, and
Agency Start Date to provide a pseudo-unique identifier for
each employee. Missing values in these columns are replaced with empty
strings before concatenation to avoid NA values in
employee_id. From the df_filtered dataset, all
unique employee_ids are extracted, and a random sample of
2500 (or fewer if less than 2500 unique IDs exist) is selected. Finally,
df_sampled is created by filtering df_filtered
to include only records belonging to these sampled
employee_ids, ensuring a representative subset for further
cleaning and analysis.
This section provides a preliminary overview of the sampled dataset.
Perform an initial overview of df_sampled using R functions
like str(), dim(), and summary()
to check row count, column types, and basic statistics.
R Code
str(df_sampled)
## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
## $ Fiscal Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency Name : chr [1:6510] "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" ...
## $ Last Name : chr [1:6510] "COHN" "CORNELL" "KERMANI" "SANGERMANO" ...
## $ First Name : chr [1:6510] "ADDAM" "JACQUELINE" "LEILA" "GREGORY" ...
## $ Mid Init : chr [1:6510] "J" "E" "M" "M" ...
## $ Agency Start Date : chr [1:6510] "05/21/2015" "05/16/2011" "09/07/1999" "09/03/2002" ...
## $ Work Location Borough : chr [1:6510] "MANHATTAN" "MANHATTAN" "MANHATTAN" "MANHATTAN" ...
## $ Title Description : chr [1:6510] "COLLEGE AIDE" "ASSISTANT DISTRICT ATTORNEY" "ASSISTANT DISTRICT ATTORNEY" "ASSISTANT DISTRICT ATTORNEY" ...
## $ Leave Status as of June 30: chr [1:6510] "CEASED" "CEASED" "ACTIVE" "ACTIVE" ...
## $ Base Salary : chr [1:6510] "$1.00" "$73000.00" "$133500.00" "$106000.00" ...
## $ Pay Basis : chr [1:6510] "per Hour" "per Annum" "per Annum" "per Annum" ...
## $ Regular Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ Regular Gross Paid : chr [1:6510] "$1750.00" "$31942.53" "$132414.94" "$105247.84" ...
## $ OT Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total OT Paid : chr [1:6510] "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ Total Other Pay : chr [1:6510] "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ employee_id : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
dim(df_sampled)
## [1] 6510 17
summary(df_sampled)
## Fiscal Year Agency Name Last Name First Name
## Min. :2015 Length:6510 Length:6510 Length:6510
## 1st Qu.:2015 Class :character Class :character Class :character
## Median :2016 Mode :character Mode :character Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2017
## Mid Init Agency Start Date Work Location Borough Title Description
## Length:6510 Length:6510 Length:6510 Length:6510
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Leave Status as of June 30 Base Salary Pay Basis
## Length:6510 Length:6510 Length:6510
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Regular Hours Regular Gross Paid OT Hours Total OT Paid
## Min. : 0.0 Length:6510 Min. : 0.00 Length:6510
## 1st Qu.: 0.0 Class :character 1st Qu.: 0.00 Class :character
## Median : 0.0 Mode :character Median : 0.00 Mode :character
## Mean : 654.4 Mean : 61.61
## 3rd Qu.:1825.0 3rd Qu.: 1.25
## Max. :2223.4 Max. :1150.50
## Total Other Pay employee_id
## Length:6510 Length:6510
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Explanation To understand the structure and basic
characteristics of the df_sampled dataframe,
str() is used to inspect the data types of each column,
dim() provides the dimensions (number of rows and columns),
and summary() generates descriptive statistics for each
column, including min, max, mean, median, and quartile values for
numerical columns, and frequency counts for categorical ones. This step
is crucial for identifying potential issues like incorrect data types or
initial data distributions.
This section details the conversion of data types and standardization of text fields. Convert currency/pay columns to numeric, date columns to date type, handle negative values in salary/pay columns by setting them to 0, and standardize categorical text columns.
R Code
df_cleaned <- df_sampled %>%
mutate(
# Convert currency columns to numeric
`Base Salary` = as.numeric(gsub("\\$|,", "", `Base Salary`)),
`Regular Gross Paid` = as.numeric(gsub("\\$|,|", "", `Regular Gross Paid`)),
`Total OT Paid` = as.numeric(gsub("\\$|,|", "", `Total OT Paid`)),
`Total Other Pay` = as.numeric(gsub("\\$|,|", "", `Total Other Pay`)),
# Convert Agency Start Date to date type
`Agency Start Date` = as.Date(`Agency Start Date`, format = "%m/%d/%Y")
) %>%
mutate(
# Replace negative values with 0 in salary/pay columns and Regular Hours
`Base Salary` = ifelse(`Base Salary` < 0, 0, `Base Salary`),
`Regular Gross Paid` = ifelse(`Regular Gross Paid` < 0, 0, `Regular Gross Paid`),
`Total OT Paid` = ifelse(`Total OT Paid` < 0, 0, `Total OT Paid`),
`Total Other Pay` = ifelse(`Total Other Pay` < 0, 0, `Total Other Pay`),
`Regular Hours` = ifelse(`Regular Hours` < 0, 0, `Regular Hours`)
) %>%
mutate(
# Standardize categorical text columns
`Agency Name` = tolower(trimws(`Agency Name`)),
`Last Name` = tolower(trimws(`Last Name`)),
`First Name` = tolower(trimws(`First Name`)),
`Mid Init` = tolower(trimws(`Mid Init`)),
`Work Location Borough` = tolower(trimws(`Work Location Borough`)),
`Title Description` = tolower(trimws(`Title Description`)),
`Pay Basis` = tolower(trimws(`Pay Basis`)),
`Leave Status as of June 30` = tolower(trimws(`Leave Status as of June 30`))
)
cat("Data types converted and text columns standardized.\n")
## Data types converted and text columns standardized.
str(df_cleaned)
## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
## $ Fiscal Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency Name : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
## $ Last Name : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
## $ First Name : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
## $ Mid Init : chr [1:6510] "j" "e" "m" "m" ...
## $ Agency Start Date : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
## $ Work Location Borough : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
## $ Title Description : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
## $ Leave Status as of June 30: chr [1:6510] "ceased" "ceased" "active" "active" ...
## $ Base Salary : num [1:6510] 1 73000 133500 106000 60597 ...
## $ Pay Basis : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
## $ Regular Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ Regular Gross Paid : num [1:6510] 1750 31943 132415 105248 59831 ...
## $ OT Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total OT Paid : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total Other Pay : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
## $ employee_id : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
summary(df_cleaned)
## Fiscal Year Agency Name Last Name First Name
## Min. :2015 Length:6510 Length:6510 Length:6510
## 1st Qu.:2015 Class :character Class :character Class :character
## Median :2016 Mode :character Mode :character Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2017
## Mid Init Agency Start Date Work Location Borough
## Length:6510 Min. :1964-02-02 Length:6510
## Class :character 1st Qu.:1999-01-19 Class :character
## Mode :character Median :2006-11-23 Mode :character
## Mean :2004-12-24
## 3rd Qu.:2013-07-09
## Max. :2017-06-19
## Title Description Leave Status as of June 30 Base Salary
## Length:6510 Length:6510 Min. : 1.00
## Class :character Class :character 1st Qu.: 33.18
## Mode :character Mode :character Median : 39842.00
## Mean : 41585.31
## 3rd Qu.: 76488.00
## Max. :204062.00
## Pay Basis Regular Hours Regular Gross Paid OT Hours
## Length:6510 Min. : 0.0 Min. : 0 Min. : 0.00
## Class :character 1st Qu.: 0.0 1st Qu.: 2795 1st Qu.: 0.00
## Mode :character Median : 0.0 Median : 33306 Median : 0.00
## Mean : 654.4 Mean : 39960 Mean : 61.61
## 3rd Qu.:1825.0 3rd Qu.: 70834 3rd Qu.: 1.25
## Max. :2223.4 Max. :241863 Max. :1150.50
## Total OT Paid Total Other Pay employee_id
## Min. : 0.0 Min. : 0 Length:6510
## 1st Qu.: 0.0 1st Qu.: 0 Class :character
## Median : 0.0 Median : 0 Mode :character
## Mean : 3422.1 Mean : 2213
## 3rd Qu.: 199.6 3rd Qu.: 1229
## Max. :87733.1 Max. :51416
Explanation This section converts critical columns
to their appropriate data types and standardizes text. Currency symbols
($) and commas (,) are removed from financial
columns (Base Salary, Regular Gross Paid,
Total OT Paid, Total Other Pay), and these
columns are converted to numeric type. The
Agency Start Date column is converted from character to
date format. Negative values in Base Salary,
Regular Gross Paid, Total OT Paid,
Total Other Pay, and Regular Hours are
logically set to 0. Finally, various categorical text columns (e.g.,
Agency Name, Last Name,
First Name) are converted to lowercase and leading/trailing
whitespace is removed to ensure consistency and facilitate future
analysis.
This section addresses missing values in specific columns of the dataset. Fill missing values in ‘Work Location Borough’ using the mode, and fill missing values (including empty strings) in ‘Mid Init’, ‘Last Name’, and ‘First Name’ with ‘unknown’.
R Code
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_work_location_borough <- get_mode(df_cleaned$`Work Location Borough`)
df_cleaned <- df_cleaned %>%
mutate(`Work Location Borough` = ifelse(is.na(`Work Location Borough`), mode_work_location_borough, `Work Location Borough`))
df_cleaned <- df_cleaned %>%
mutate(
`Mid Init` = ifelse(is.na(`Mid Init`) | `Mid Init` == "", "unknown", `Mid Init`),
`Last Name` = ifelse(is.na(`Last Name`) | `Last Name` == "", "unknown", `Last Name`),
`First Name` = ifelse(is.na(`First Name`) | `First Name` == "", "unknown", `First Name`)
)
cat("Missing values in specified columns have been handled.\n")
## Missing values in specified columns have been handled.
cat("NA count in Work Location Borough: ", sum(is.na(df_cleaned$`Work Location Borough`)), "\n")
## NA count in Work Location Borough: 0
cat("Empty string count in Work Location Borough: ", sum(df_cleaned$`Work Location Borough` == ""), "\n")
## Empty string count in Work Location Borough: 0
cat("NA count in Mid Init: ", sum(is.na(df_cleaned$`Mid Init`)), "\n")
## NA count in Mid Init: 0
cat("Empty string count in Mid Init: ", sum(df_cleaned$`Mid Init` == ""), "\n")
## Empty string count in Mid Init: 0
cat("NA count in Last Name: ", sum(is.na(df_cleaned$`Last Name`)), "\n")
## NA count in Last Name: 0
cat("Empty string count in Last Name: ", sum(df_cleaned$`Last Name` == ""), "\n")
## Empty string count in Last Name: 0
cat("NA count in First Name: ", sum(is.na(df_cleaned$`First Name`)), "\n")
## NA count in First Name: 0
cat("Empty string count in First Name: ", sum(df_cleaned$`First Name` == ""), "\n")
## Empty string count in First Name: 0
Explanation Missing values are handled to ensure
data completeness. For Work Location Borough, a custom
get_mode function is defined to calculate the mode (most
frequent value), which is then used to impute any NA values
in this column. For Mid Init, Last Name, and
First Name, both NA values and empty strings
are replaced with the placeholder "unknown". This
comprehensive imputation ensures no critical categorical data points are
left missing.
This section describes the identification and treatment of outliers in numerical columns. Apply IQR capping to ‘Base Salary’ while explicitly skipping ‘OT Hours’, ‘Total OT Paid’, and ‘Total Other Pay’.
R Code
cap_outliers_iqr <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
upper_bound <- q3 + 1.5 * iqr
lower_bound <- q1 - 1.5 * iqr
x[x > upper_bound] <- upper_bound
x[x < lower_bound] <- lower_bound
return(x)
}
cat("Summary of Base Salary BEFORE IQR Capping:\n")
## Summary of Base Salary BEFORE IQR Capping:
print(summary(df_cleaned$`Base Salary`))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.18 39842.00 41585.31 76488.00 204062.00
df_cleaned$`Base Salary` <- cap_outliers_iqr(df_cleaned$`Base Salary`)
cat("\nSummary of Base Salary AFTER IQR Capping:\n")
##
## Summary of Base Salary AFTER IQR Capping:
print(summary(df_cleaned$`Base Salary`))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.18 39842.00 41582.11 76488.00 191170.23
cat("IQR capping applied to 'Base Salary'. Other specified columns were skipped.\n")
## IQR capping applied to 'Base Salary'. Other specified columns were skipped.
Explanation Outliers in Base Salary
were capped using the Interquartile Range (IQR) method. Values falling
below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
were replaced with the respective bounds. This approach mitigates the
impact of extreme values on statistical analyses. Other potentially
skewed financial columns, such as OT Hours,
Total OT Paid, and Total Other Pay, were
intentionally excluded from capping. This decision was made because high
values in these categories might represent legitimate compensation
(e.g., extensive overtime or bonuses) and blanket capping could distort
their true variability, which is best handled during the analysis phase
(e.g., via log transformations) to preserve information.
This section focuses on ensuring data uniqueness and converting
categorical variables to factor types. Remove all duplicate rows from
df_cleaned and convert ‘Agency Name’, ‘Work Location
Borough’, ‘Leave Status as of June 30’, ‘Pay Basis’, and ‘Title
Description’ to factor types.
R Code
cat(paste("Number of records BEFORE removing duplicates:", nrow(df_cleaned), "\n"))
## Number of records BEFORE removing duplicates: 6510
df_cleaned <- df_cleaned %>%
distinct()
cat(paste("Number of records AFTER removing duplicates:", nrow(df_cleaned), "\n"))
## Number of records AFTER removing duplicates: 6510
df_cleaned <- df_cleaned %>%
mutate(
`Agency Name` = as.factor(`Agency Name`),
`Work Location Borough` = as.factor(`Work Location Borough`),
`Leave Status as of June 30` = as.factor(`Leave Status as of June 30`),
`Pay Basis` = as.factor(`Pay Basis`),
`Title Description` = as.factor(`Title Description`)
)
cat("Duplicate rows removed and specified columns converted to factor types.\n")
## Duplicate rows removed and specified columns converted to factor types.
str(df_cleaned)
## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
## $ Fiscal Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency Name : Factor w/ 65 levels "admin for children's svcs",..: 43 43 43 43 43 43 45 46 46 46 ...
## $ Last Name : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
## $ First Name : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
## $ Mid Init : chr [1:6510] "j" "e" "m" "m" ...
## $ Agency Start Date : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
## $ Work Location Borough : Factor w/ 8 levels "bronx","brooklyn",..: 3 3 3 3 3 3 2 5 5 5 ...
## $ Title Description : Factor w/ 334 levels "12 month special education asst. principal",..: 104 39 39 39 107 39 39 39 39 39 ...
## $ Leave Status as of June 30: Factor w/ 5 levels "active","ceased",..: 2 2 1 1 1 4 1 1 1 1 ...
## $ Base Salary : num [1:6510] 1 73000 133500 106000 60597 ...
## $ Pay Basis : Factor w/ 4 levels "per annum","per day",..: 3 1 1 1 1 1 1 1 1 1 ...
## $ Regular Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ Regular Gross Paid : num [1:6510] 1750 31943 132415 105248 59831 ...
## $ OT Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total OT Paid : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total Other Pay : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
## $ employee_id : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
Explanation To maintain data integrity and prevent
biased analyses, all completely duplicate rows were removed from
df_cleaned. Following this, key categorical variables such
as Agency Name, Work Location Borough,
Leave Status as of June 30, Pay Basis, and
Title Description were explicitly converted to R’s
factor type. This conversion optimizes memory usage for
categorical data and prepares these variables for statistical modeling
and visualization functions that often require factor inputs.
This final section details the saving of the processed dataset for future use. Save the cleaned dataset to new CSV and RData files.
R Code
write_csv(df_cleaned, "cleaned_payroll_data.csv")
save(df_cleaned, file = "cleaned_payroll_data.RData")
cat("Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.\n")
## Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.
Explanation The thoroughly cleaned and preprocessed
df_cleaned dataset is saved in two formats: as a CSV file
(cleaned_payroll_data.csv) and as an RData file
(cleaned_payroll_data.RData). Saving in both formats
ensures accessibility and ease of use for different R workflows and for
sharing with other team members or tools. This step marks the completion
of the data cleaning phase, providing a refined dataset ready for
subsequent visualization, analysis, and model building.
The entire data cleaning process for the New York City - Citywide Payroll Data has been completed, transforming the raw dataset into a high-quality, analysis-ready format. The key steps undertaken include:
Work Location Borough values were imputed using the mode.
Missing or empty strings in Mid Init,
Last Name, and First Name were filled with
"unknown", crucial for maintaining employee_id
integrity.Base Salary were capped using the IQR method to mitigate
extreme values’ impact. OT Hours,
Total OT Paid, and Total Other Pay were
intentionally skipped to preserve their inherent variability.Outcome: The df_cleaned dataset is now
robust, consistent, and free from common data quality issues. It has
been saved as cleaned_payroll_data.csv and
cleaned_payroll_data.RData and is fully prepared for
subsequent data visualization, in-depth analysis, and model building by
the respective team members.
This report provides a comprehensive visualization analysis of New York City payroll data over the last three fiscal years (2015-2017). The report includes both the visualizations and the R code used to generate them.
# Load all required libraries
library(tidyverse) # For data manipulation and visualization
library(ggplot2) # For creating graphics
library(scales) # For formatting plot scales
library(RColorBrewer) # For color palettes
library(patchwork) # For combining multiple plots
library(knitr) # For creating tables
# Read the cleaned payroll data
df_cleaned <- read_csv("cleaned_payroll_data.csv")
# Display data structure
cat("Data Structure:\n")
## Data Structure:
str(df_cleaned, give.attr = FALSE)
## spc_tbl_ [6,510 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Fiscal Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency Name : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
## $ Last Name : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
## $ First Name : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
## $ Mid Init : chr [1:6510] "j" "e" "m" "m" ...
## $ Agency Start Date : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
## $ Work Location Borough : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
## $ Title Description : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
## $ Leave Status as of June 30: chr [1:6510] "ceased" "ceased" "active" "active" ...
## $ Base Salary : num [1:6510] 1 73000 133500 106000 60597 ...
## $ Pay Basis : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
## $ Regular Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ Regular Gross Paid : num [1:6510] 1750 31943 132415 105248 59831 ...
## $ OT Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total OT Paid : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total Other Pay : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
## $ employee_id : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
# Display first few rows
cat("\nFirst 6 rows of data:\n")
##
## First 6 rows of data:
head(df_cleaned)
## # A tibble: 6 × 17
## `Fiscal Year` `Agency Name` `Last Name` `First Name` `Mid Init`
## <dbl> <chr> <chr> <chr> <chr>
## 1 2016 district attorney-manhattan cohn addam j
## 2 2016 district attorney-manhattan cornell jacqueline e
## 3 2016 district attorney-manhattan kermani leila m
## 4 2016 district attorney-manhattan sangermano gregory m
## 5 2016 district attorney-manhattan scott gainsworth unknown
## 6 2016 district attorney-manhattan walton sherita m
## # ℹ 12 more variables: `Agency Start Date` <date>,
## # `Work Location Borough` <chr>, `Title Description` <chr>,
## # `Leave Status as of June 30` <chr>, `Base Salary` <dbl>, `Pay Basis` <chr>,
## # `Regular Hours` <dbl>, `Regular Gross Paid` <dbl>, `OT Hours` <dbl>,
## # `Total OT Paid` <dbl>, `Total Other Pay` <dbl>, employee_id <chr>
# Calculate basic data statistics
total_records <- nrow(df_cleaned)
num_columns <- ncol(df_cleaned)
fiscal_year_range <- paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to ")
unique_employees <- length(unique(df_cleaned$employee_id))
num_agencies <- length(unique(df_cleaned$`Agency Name`))
num_locations <- length(unique(df_cleaned$`Work Location Borough`))
# Display statistics
cat("## Data Overview\n")
## ## Data Overview
cat("- **Total Records**:", format(total_records, big.mark = ","), "\n")
## - **Total Records**: 6,510
cat("- **Number of Columns**:", num_columns, "\n")
## - **Number of Columns**: 17
cat("- **Fiscal Year Range**:", fiscal_year_range, "\n")
## - **Fiscal Year Range**: 2015 to 2017
cat("- **Unique Employees (by ID)**:", format(unique_employees, big.mark = ","), "\n")
## - **Unique Employees (by ID)**: 2,500
cat("- **Number of Agencies**:", num_agencies, "\n")
## - **Number of Agencies**: 65
cat("- **Work Locations**:", num_locations, "\n")
## - **Work Locations**: 8
# Calculate summary statistics for interpretation
salary_summary <- df_cleaned %>%
summarise(
mean_salary = mean(`Base Salary`, na.rm = TRUE),
median_salary = median(`Base Salary`, na.rm = TRUE),
min_salary = min(`Base Salary`, na.rm = TRUE),
max_salary = max(`Base Salary`, na.rm = TRUE),
sd_salary = sd(`Base Salary`, na.rm = TRUE)
)
# Print summary
cat("Base Salary Summary Statistics:\n")
## Base Salary Summary Statistics:
print(salary_summary)
## # A tibble: 1 × 5
## mean_salary median_salary min_salary max_salary sd_salary
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 41582. 39842 1 191170. 40033.
# Create histogram with custom breaks and colors
salary_histogram <- ggplot(df_cleaned, aes(x = `Base Salary`)) +
geom_histogram(
bins = 40, # Number of bins
fill = "#3498db", # Fill color
alpha = 0.8, # Transparency
color = "white", # Border color
boundary = 0 # Start at 0
) +
labs(
title = "Distribution of Base Salary - NYC Employees",
subtitle = paste("Total Employees:", format(nrow(df_cleaned), big.mark = ",")),
x = "Base Salary ($)",
y = "Number of Employees"
) +
scale_x_continuous(
labels = scales::dollar, # Format as dollar amounts
breaks = seq(0, 200000, 50000) # Custom breaks for better readability
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 11, color = "gray50", hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
)
# Display the histogram
salary_histogram
Figure 1: Distribution of Base Salary
# Add vertical lines for mean and median
salary_histogram +
geom_vline(
xintercept = salary_summary$mean_salary,
color = "red",
linetype = "dashed",
size = 1
) +
geom_vline(
xintercept = salary_summary$median_salary,
color = "blue",
linetype = "dashed",
size = 1
) +
annotate(
"text",
x = salary_summary$mean_salary,
y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.9,
label = paste("Mean: $", format(round(salary_summary$mean_salary, 0), big.mark = ",")),
color = "red",
hjust = -0.1
) +
annotate(
"text",
x = salary_summary$median_salary,
y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.8,
label = paste("Median: $", format(round(salary_summary$median_salary, 0), big.mark = ",")),
color = "blue",
hjust = -0.1
)
Figure 1: Distribution of Base Salary
Interpretation: - The histogram shows the
distribution of base salaries for NYC employees - The majority of
employees (approximately 70%) have base salaries concentrated in the
$30,000-$60,000 range - The distribution is
right-skewed, indicating a few high-earning outliers -
Red dashed line shows the mean salary, blue dashed line shows the median
salary
Analysis: - This distribution characteristic is crucial
for the regression model. Since the base salary variable does not
conform to the normal distribution assumption, directly performing
linear regression may lead to significant errors in the model’s
predictions of high-salary samples. Therefore, it is recommended to
perform a log-transformation on the “base salary” before modeling to
make it closer to a normal distribution, thereby improving the model’s
robustness.
# Calculate borough statistics
borough_stats <- df_cleaned %>%
group_by(`Work Location Borough`) %>%
summarise(
median_salary = median(`Base Salary`, na.rm = TRUE),
mean_salary = mean(`Base Salary`, na.rm = TRUE),
min_salary = min(`Base Salary`, na.rm = TRUE),
max_salary = max(`Base Salary`, na.rm = TRUE),
iqr_salary = IQR(`Base Salary`, na.rm = TRUE),
count = n()
) %>%
arrange(desc(median_salary))
# Print borough statistics
cat("Borough Salary Statistics (sorted by median salary):\n")
## Borough Salary Statistics (sorted by median salary):
print(borough_stats)
## # A tibble: 8 × 7
## `Work Location Borough` median_salary mean_salary min_salary max_salary
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ulster 112978 110435 52920 167180
## 2 other 105142 104493. 59435 159609
## 3 richmond 54063 56554. 9.62 154822
## 4 westchester 53582. 64594. 36899 120515
## 5 queens 51662 50826. 9.3 191170.
## 6 brooklyn 45970 48012. 6.08 143527
## 7 bronx 40568. 40796. 9.3 121875
## 8 manhattan 29389 35616. 1 183621
## # ℹ 2 more variables: iqr_salary <dbl>, count <int>
# Store the highest borough name for later use
highest_borough_name <- borough_stats$`Work Location Borough`[1]
highest_borough_median <- borough_stats$median_salary[1]
# Create color palette for boroughs
num_boroughs <- nrow(borough_stats)
color_palette <- brewer.pal(min(num_boroughs, 8), "Set3")
color_palette <- c(brewer.pal(8, "Set3"), "#999999")
# Create the boxplot
borough_boxplot <- ggplot(df_cleaned,
aes(
x = reorder(`Work Location Borough`, `Base Salary`, FUN = median),
y = `Base Salary`,
fill = `Work Location Borough`
)
) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Base Salary Distribution by Work Location",
x = "Work Location Borough",
y = "Base Salary ($)"
) +
scale_y_continuous(labels = scales::dollar) +
theme_minimal() +
theme(legend.position = "none")
# Display the boxplot
borough_boxplot
Figure 2: Base Salary Distribution by Borough
# Add horizontal line for overall median
overall_median <- median(df_cleaned$`Base Salary`, na.rm = TRUE)
borough_boxplot +
geom_hline(
yintercept = overall_median,
color = "black",
linetype = "dashed",
size = 0.8,
alpha = 0.5
) +
annotate(
"text",
x = 0.5,
y = overall_median,
label = paste("Overall Median: $", format(round(overall_median, 0), big.mark = ",")),
color = "black",
hjust = 0,
vjust = -0.5,
size = 3.5
)
Figure 2: Base Salary Distribution by Borough
Interpretation: - ulster has the
highest median salary at $112,978 -
Brooklyn shows the widest salary range (largest IQR) -
Black dashed line shows the overall median salary across all boroughs -
Boxplots show median (middle line), IQR (box), and outliers (individual
points)
Analysis: - The boxplot reveals significant geographic
disparities in salary. Manhattan exhibits the highest median wage and
widest IQR, reflecting a concentration of high-paying headquarters
roles. In contrast, the Bronx and Richmond show lower, more uniform
salary distributions, likely due to a prevalence of operational roles. -
Borough is a feature with strong distinguishing ability. When
constructing a regression model, the variable must be included as the
core classification variable.
# Calculate average salary by agency
top_agencies <- df_cleaned %>%
group_by(`Agency Name`) %>%
summarise(
avg_salary = mean(`Base Salary`, na.rm = TRUE),
employee_count = n(),
total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
) %>%
arrange(desc(avg_salary)) %>%
head(10) # Get top 10 agencies
# Print top agencies data
cat("Top 10 Agencies by Average Salary:\n")
## Top 10 Agencies by Average Salary:
print(top_agencies)
## # A tibble: 10 × 4
## `Agency Name` avg_salary employee_count total_payroll
## <chr> <dbl> <int> <dbl>
## 1 president borough of manhattan 106628 3 286011.
## 2 financial info svcs agency 105420 4 409398.
## 3 dept of youth & comm dev srvs 95820. 10 778230.
## 4 district attorney qns county 94694. 18 1829036.
## 5 district attorney kings county 91874 3 279162.
## 6 office of the mayor 90510. 12 828048.
## 7 borough president-queens 88321. 8 590703.
## 8 dept of ed pedagogical 84453. 1221 87523733.
## 9 department of buildings 84344 12 886832.
## 10 district attorney-manhattan 81660. 18 1280759.
# Store highest agency info for later use
highest_agency_name <- top_agencies$`Agency Name`[1]
highest_agency_salary <- top_agencies$avg_salary[1]
# Calculate salary range for scaling
salary_range <- range(top_agencies$avg_salary)
cat("\nSalary Range for Top 10 Agencies: $",
format(round(salary_range[1], 0), big.mark = ","),
" to $",
format(round(salary_range[2], 0), big.mark = ","),
"\n")
##
## Salary Range for Top 10 Agencies: $ 81,660 to $ 106,628
# Create the bar chart
top_agencies_plot <- ggplot(top_agencies, aes(x = reorder(`Agency Name`, avg_salary), y = avg_salary)) +
geom_bar(
stat = "identity",
fill = "#2ecc71",
alpha = 0.8,
width = 0.7 # Bar width
) +
coord_flip() + # Flip coordinates for horizontal bars
labs(
title = "Top 10 Agencies by Average Base Salary",
x = "Agency Name",
y = "Average Base Salary ($)",
subtitle = "Agencies with highest average base salaries"
) +
geom_text(
aes(label = paste0("$", format(round(avg_salary, 0), big.mark = ","))),
hjust = -0.1, # Position text to the right of bars
size = 3.5,
color = "black"
) +
scale_y_continuous(
labels = scales::dollar,
expand = expansion(mult = c(0, 0.15)) # Add 15% space on top for labels
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 12),
panel.grid.major.y = element_blank(), # Remove horizontal grid lines
panel.grid.minor.y = element_blank()
)
# Display the plot
top_agencies_plot
Figure 3: Top 10 Agencies by Average Base Salary
# Add annotation for highest paying agency
top_agencies_plot +
annotate(
"text",
x = 10, # Position at top
y = highest_agency_salary * 0.8,
label = paste("Highest: ", highest_agency_name, "\n$", format(round(highest_agency_salary, 0), big.mark = ",")),
color = "darkgreen",
size = 4,
hjust = 0,
fontface = "bold"
)
Figure 3: Top 10 Agencies by Average Base Salary
Interpretation: - president borough of
manhattan is the highest-paying agency with an average salary
of $106,628 - All top 10 agencies have average salaries
above $80,000 - Horizontal bar chart allows for easy
comparison of agency salaries
Analysis: - Agency is one of the strongest features of
wage prediction. However, due to the large number and different nature
of institutions in New York City, direct insertion into the model may
lead to a dimensional disaster. It is recommended that when dealing with
this feature, the regression group should consider classifying
institutions according to the nature of the business (such as legal,
administrative, public security), or combining small institutions to
optimize the model structure.
# Calculate agency statistics for scatter plot
agency_stats <- df_cleaned %>%
group_by(`Agency Name`) %>%
summarise(
avg_salary = mean(`Base Salary`, na.rm = TRUE),
employee_count = n(),
total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
) %>%
filter(employee_count >= 10) # Filter out very small agencies
# Print summary of agency sizes
cat("Agency Size Statistics:\n")
## Agency Size Statistics:
agency_size_summary <- agency_stats %>%
summarise(
min_employees = min(employee_count),
median_employees = median(employee_count),
mean_employees = mean(employee_count),
max_employees = max(employee_count),
total_agencies = n()
)
print(agency_size_summary)
## # A tibble: 1 × 5
## min_employees median_employees mean_employees max_employees total_agencies
## <int> <int> <dbl> <int> <int>
## 1 10 52 149. 1221 43
# Create the scatter plot
agency_scatter <- ggplot(agency_stats,
aes(x = employee_count,
y = avg_salary,
size = total_payroll,
color = avg_salary)) +
geom_point(
alpha = 0.6, # Point transparency
shape = 19 # Circle shape
) +
scale_color_gradient(
low = "#f1c40f", # Yellow for low salaries
high = "#e74c3c", # Red for high salaries
name = "Avg Salary"
) +
scale_size_continuous(
range = c(2, 10), # Size range for points
labels = scales::dollar,
name = "Total Payroll"
) +
labs(
title = "Relationship Between Agency Size and Average Salary",
x = "Number of Employees (log scale)",
y = "Average Base Salary ($)",
subtitle = "Point size represents total payroll, color represents average salary"
) +
scale_x_log10(
labels = scales::comma # Format with commas
) +
scale_y_continuous(
labels = scales::dollar,
limits = c(0, max(agency_stats$avg_salary) * 1.1) # Add 10% padding
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12),
legend.position = "right",
legend.box = "vertical"
)
# Display the scatter plot
agency_scatter
Figure 4: Relationship Between Agency Size and Average Salary
# Add trend line
agency_scatter +
geom_smooth(
method = "lm", # Linear regression
formula = y ~ log(x),
color = "blue",
linetype = "dashed",
se = FALSE, # Don't show confidence interval
size = 1
) +
annotate(
"text",
x = median(agency_stats$employee_count),
y = max(agency_stats$avg_salary) * 0.9,
label = "Trend: Larger agencies tend to have\nlower average salaries",
color = "blue",
size = 4,
hjust = 0.5
)
Figure 4: Relationship Between Agency Size and Average Salary
Interpretation: - Larger agencies (1,000+
employees) tend to have lower average salaries
- Smaller agencies show greater variability in average
salaries - Blue dashed line shows the logarithmic trend
- Point size represents total payroll (larger points =
larger payroll) - Point color represents average salary
(redder = higher salary)
Analysis: - The scatter chart reveals that there is a
nonlinear inverse relationship between the size of the institution and
the average wage, so it cannot be simply assumed that “the larger the
institution, the higher the salary”. In the regression model, a binary
feature (such as “whether it is a large institution”) can be
constructed, or combined with the characteristics of the institution
name to capture this structural salary difference.
# Create salary composition data
salary_by_status <- df_cleaned %>%
group_by(`Leave Status as of June 30`) %>%
summarise(
avg_base = mean(`Base Salary`, na.rm = TRUE),
avg_ot = mean(`Total OT Paid`, na.rm = TRUE),
avg_other = mean(`Total Other Pay`, na.rm = TRUE),
count = n()
) %>%
filter(count >= 50) # Filter out statuses with few employees
# Print salary composition data
cat("Salary Composition by Leave Status:\n")
## Salary Composition by Leave Status:
print(salary_by_status)
## # A tibble: 4 × 5
## `Leave Status as of June 30` avg_base avg_ot avg_other count
## <chr> <dbl> <dbl> <dbl> <int>
## 1 active 41819. 3870. 2486. 5556
## 2 ceased 42659. 800. 569. 817
## 3 on leave 49251. 749. 909. 57
## 4 seasonal 73.3 0.767 97.1 71
# Calculate percentage composition
salary_composition_percent <- salary_by_status %>%
mutate(
total = avg_base + avg_ot + avg_other,
base_pct = avg_base / total * 100,
ot_pct = avg_ot / total * 100,
other_pct = avg_other / total * 100
)
cat("\nPercentage Composition:\n")
##
## Percentage Composition:
print(salary_composition_percent %>%
select(`Leave Status as of June 30`, base_pct, ot_pct, other_pct))
## # A tibble: 4 × 4
## `Leave Status as of June 30` base_pct ot_pct other_pct
## <chr> <dbl> <dbl> <dbl>
## 1 active 86.8 8.03 5.16
## 2 ceased 96.9 1.82 1.29
## 3 on leave 96.7 1.47 1.78
## 4 seasonal 42.8 0.448 56.7
# Store active employee percentage for later use
active_base_pct <- salary_composition_percent %>%
filter(`Leave Status as of June 30` == "Active") %>%
pull(base_pct) %>%
round(0)
# Transform data for stacked bar chart
salary_long <- salary_by_status %>%
pivot_longer(
cols = c(avg_base, avg_ot, avg_other),
names_to = "pay_type",
values_to = "amount"
) %>%
mutate(
pay_type = factor(
pay_type,
levels = c("avg_base", "avg_ot", "avg_other"),
labels = c("Base Salary", "Overtime Pay", "Other Pay")
)
)
# Create the stacked bar chart
composition_plot <- ggplot(salary_long,
aes(x = `Leave Status as of June 30`,
y = amount,
fill = pay_type)) +
geom_bar(
stat = "identity",
position = "stack",
width = 0.7 # Bar width
) +
scale_fill_brewer(
palette = "Set2",
name = "Pay Type",
labels = c("Base Salary", "Overtime", "Other Pay")
) +
labs(
title = "Salary Composition by Leave Status",
x = "Leave Status (as of June 30)",
y = "Average Amount ($)",
subtitle = "Breakdown of compensation by pay type"
) +
scale_y_continuous(
labels = scales::dollar,
expand = expansion(mult = c(0, 0.1)) # Add space at top
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.title = element_text(size = 12),
legend.position = "bottom",
legend.title = element_text(face = "bold")
)
# Display the stacked bar chart
composition_plot
Figure 5: Salary Composition by Leave Status
# Add percentage labels for active employees
active_data <- salary_long %>%
filter(`Leave Status as of June 30` == "Active" & pay_type == "Base Salary")
if (nrow(active_data) > 0) {
composition_plot +
annotate(
"text",
x = which(unique(salary_long$`Leave Status as of June 30`) == "Active"),
y = active_data$amount * 0.5,
label = paste0(active_base_pct, "%"),
color = "white",
size = 5,
fontface = "bold"
)
} else {
composition_plot
}
Figure 5: Salary Composition by Leave Status
Interpretation: - Active employees:
Primarily earn through base salary (% of total) -
On Leave employees: Show a higher proportion of
Other Pay (≈25%), likely due to special allowances or
benefits - Ceased employees: Have minimal overtime
pay
Analysis: - Stacked bars show how compensation
structure changes with employment status This discovery directly serves
the task of classification prediction. Data shows that “the proportion
of other income” and “the amount of overtime pay” are strong signals to
judge whether employees leave. It is recommended that the classification
group carry out feature engineering and calculate “the ratio of
other income to base wages” as the core predictive variable,
which is likely to greatly improve the accuracy of the classification
model (such as logical regression or decision tree).
# Calculate yearly statistics
yearly_stats <- df_cleaned %>%
group_by(`Fiscal Year`) %>%
summarise(
avg_base = mean(`Base Salary`, na.rm = TRUE),
avg_regular = mean(`Regular Gross Paid`, na.rm = TRUE),
employee_count = n(),
total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
)
# Print yearly statistics
cat("Yearly Salary Statistics:\n")
## Yearly Salary Statistics:
print(yearly_stats)
## # A tibble: 3 × 5
## `Fiscal Year` avg_base avg_regular employee_count total_payroll
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 2015 38420. 36051. 2252 81186430.
## 2 2016 43309. 41048. 2091 85831335.
## 3 2017 43201. 42973. 2167 93123340.
# Calculate percentage changes
yearly_stats <- yearly_stats %>%
mutate(
base_pct_change = (avg_base - lag(avg_base)) / lag(avg_base) * 100,
regular_pct_change = (avg_regular - lag(avg_regular)) / lag(avg_regular) * 100
)
cat("\nPercentage Changes from Previous Year:\n")
##
## Percentage Changes from Previous Year:
print(yearly_stats %>% select(`Fiscal Year`, base_pct_change, regular_pct_change))
## # A tibble: 3 × 3
## `Fiscal Year` base_pct_change regular_pct_change
## <dbl> <dbl> <dbl>
## 1 2015 NA NA
## 2 2016 12.7 13.9
## 3 2017 -0.250 4.69
# Store key values for later use
start_year <- yearly_stats$`Fiscal Year`[1]
end_year <- yearly_stats$`Fiscal Year`[3]
start_salary <- yearly_stats$avg_base[1]
end_salary <- yearly_stats$avg_base[3]
total_pct_change <- round(((end_salary - start_salary) / start_salary) * 100, 1)
# Transform to long format for plotting
yearly_long <- yearly_stats %>%
select(`Fiscal Year`, avg_base, avg_regular) %>%
pivot_longer(
cols = -`Fiscal Year`,
names_to = "salary_type",
values_to = "amount"
) %>%
mutate(
salary_type = factor(
salary_type,
levels = c("avg_base", "avg_regular"),
labels = c("Base Salary", "Regular Gross Paid")
)
)
# Create the line chart
trend_plot <- ggplot(yearly_long,
aes(x = as.factor(`Fiscal Year`),
y = amount,
group = salary_type,
color = salary_type)) +
geom_line(
size = 1.2, # Line thickness
linetype = "solid" # Line style
) +
geom_point(
size = 3, # Point size
shape = 19 # Circle shape
) +
scale_color_manual(
values = c("Base Salary" = "#3498db", "Regular Gross Paid" = "#2ecc71"),
name = "Salary Type"
) +
labs(
title = "Salary Trends by Fiscal Year",
x = "Fiscal Year",
y = "Average Amount ($)",
subtitle = "Comparison of base salary vs regular gross pay over time"
) +
scale_y_continuous(
labels = scales::dollar,
limits = c(min(yearly_long$amount) * 0.95, max(yearly_long$amount) * 1.05)
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12),
legend.position = "bottom",
legend.title = element_text(face = "bold"),
panel.grid.major.x = element_blank() # Remove vertical grid lines
)
# Display the trend plot
trend_plot
Figure 6: Salary Trends by Fiscal Year
# Add percentage change annotations
for (i in 2:nrow(yearly_stats)) {
# Calculate position for base salary annotation
base_y <- yearly_stats$avg_base[i]
base_change <- yearly_stats$base_pct_change[i]
# Add annotation for significant changes
if (!is.na(base_change) && abs(base_change) > 1) {
trend_plot <- trend_plot +
annotate(
"text",
x = i,
y = base_y * 1.02,
label = paste0(ifelse(base_change > 0, "+", ""), round(base_change, 1), "%"),
color = "#3498db",
size = 3.5,
fontface = "bold"
)
}
}
trend_plot
Figure 6: Salary Trends by Fiscal Year
Interpretation: - 2015 to 2017:
Average base salary increased by 12.4% (from $38,420 to
$43,201) - Regular gross pay shows a similar upward
trend - Blue line: Base salary trend - Green
line: Regular gross pay trend - Percentage
labels: Show year-over-year changes
Analysis: - The time factor cannot be ignored in salary
forecasting. If the “Fiscal Year” is not included in the model, the
model will not be able to explain the natural wage growth over time,
resulting in low forecasts of recent data.
# Create multiple subplots for dashboard
# Plot 1: Histogram with density overlay
p1 <- ggplot(df_cleaned, aes(x = `Base Salary`)) +
geom_histogram(
aes(y = ..density..), # Use density instead of count
bins = 30,
fill = "#3498db",
alpha = 0.7,
color = "white"
) +
geom_density(
color = "#2c3e50",
size = 1,
alpha = 0.5
) +
labs(
title = "Base Salary Distribution",
x = "Base Salary ($)",
y = "Density"
) +
scale_x_continuous(labels = scales::dollar) +
theme_minimal() +
theme(
plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
axis.title = element_text(size = 9)
)
# Plot 2: Boxplot by borough (simplified)
p2 <- ggplot(df_cleaned, aes(x = `Work Location Borough`, y = `Base Salary`)) +
geom_boxplot(
fill = "#e74c3c",
alpha = 0.7,
outlier.size = 0.8
) +
labs(
title = "Salary by Borough",
x = "Borough",
y = "Base Salary ($)"
) +
scale_y_continuous(
labels = scales::dollar,
limits = c(0, quantile(df_cleaned$`Base Salary`, 0.95, na.rm = TRUE)) # Remove extreme outliers
) +
theme_minimal() +
theme(
plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.title = element_text(size = 9)
)
# Plot 3: Employees by leave status
p3_data <- df_cleaned %>%
count(`Leave Status as of June 30`) %>%
mutate(percentage = n / sum(n) * 100)
p3 <- ggplot(p3_data, aes(x = reorder(`Leave Status as of June 30`, n), y = n)) +
geom_bar(
stat = "identity",
fill = "#2ecc71",
alpha = 0.7,
width = 0.6
) +
coord_flip() +
geom_text(
aes(label = paste0(round(percentage, 1), "%")),
hjust = -0.1,
size = 3
) +
labs(
title = "Employees by Leave Status",
x = "Leave Status",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
axis.title = element_text(size = 9)
)
# Plot 4: Employees by pay basis
p4_data <- df_cleaned %>%
count(`Pay Basis`) %>%
mutate(percentage = n / sum(n) * 100)
p4 <- ggplot(p4_data, aes(x = reorder(`Pay Basis`, n), y = n)) +
geom_bar(
stat = "identity",
fill = "#9b59b6",
alpha = 0.7,
width = 0.6
) +
coord_flip() +
geom_text(
aes(label = paste0(round(percentage, 1), "%")),
hjust = -0.1,
size = 3
) +
labs(
title = "Employees by Pay Basis",
x = "Pay Basis",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
axis.title = element_text(size = 9)
)
# Combine charts using patchwork
(p1 + p2) / (p3 + p4) +
plot_annotation(
title = "NYC Payroll Data Dashboard",
subtitle = "Key Insights at a Glance - Interactive View of Multiple Dimensions",
caption = "Source: NYC Citywide Payroll Data 2015-2017",
theme = theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 12, color = "gray50", hjust = 0.5),
plot.caption = element_text(size = 10, color = "gray60", hjust = 1)
)
)
Figure 7: NYC Payroll Data Dashboard
Dashboard Components: 1. Top-left: Base salary distribution with density overlay 2. Top-right: Salary distribution by borough (boxplots) 3. Bottom-left: Employee count by leave status with percentages 4. Bottom-right: Employee count by pay basis with percentages
# Create comprehensive summary statistics
complete_summary <- data.frame(
Metric = c(
"Total Employees (Records)",
"Unique Employees (by ID)",
"Number of Agencies",
"Work Locations (Boroughs)",
"Fiscal Year Range",
"Average Base Salary",
"Median Base Salary",
"Maximum Base Salary",
"Minimum Base Salary",
"Base Salary Standard Deviation",
"Total Payroll (Regular Gross Paid)",
"Average Regular Gross Paid",
"Most Common Leave Status",
"Most Common Pay Basis"
),
Value = c(
format(nrow(df_cleaned), big.mark = ","),
format(length(unique(df_cleaned$employee_id)), big.mark = ","),
length(unique(df_cleaned$`Agency Name`)),
length(unique(df_cleaned$`Work Location Borough`)),
paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to "),
paste0("$", format(round(mean(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(median(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(max(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(min(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(sd(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(sum(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
paste0("$", format(round(mean(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
names(sort(table(df_cleaned$`Leave Status as of June 30`), decreasing = TRUE))[1],
names(sort(table(df_cleaned$`Pay Basis`), decreasing = TRUE))[1]
)
)
# Display as formatted table
kable(complete_summary,
caption = "Complete Summary Statistics of NYC Payroll Data",
align = c("l", "r"),
col.names = c("Metric", "Value"))
| Metric | Value |
|---|---|
| Total Employees (Records) | 6,510 |
| Unique Employees (by ID) | 2,500 |
| Number of Agencies | 65 |
| Work Locations (Boroughs) | 8 |
| Fiscal Year Range | 2015 to 2017 |
| Average Base Salary | $41,582 |
| Median Base Salary | $39,842 |
| Maximum Base Salary | $191,170 |
| Minimum Base Salary | $1 |
| Base Salary Standard Deviation | $40,033 |
| Total Payroll (Regular Gross Paid) | $260,141,105 |
| Average Regular Gross Paid | $39,960 |
| Most Common Leave Status | active |
| Most Common Pay Basis | per annum |
# Display full session information
session_info <- sessionInfo()
cat("R Version:", session_info$R.version$version.string, "\n")
## R Version: R version 4.5.1 (2025-06-13 ucrt)
cat("Platform:", session_info$R.version$platform, "\n")
## Platform: x86_64-w64-mingw32
cat("Operating System:", session_info$running, "\n\n")
## Operating System: Windows 11 x64 (build 26200)
cat("Loaded Packages:\n")
## Loaded Packages:
packages <- data.frame(
Package = names(session_info$otherPkgs),
Version = sapply(session_info$otherPkgs, function(x) x$Version)
)
print(packages)
## Package Version
## RColorBrewer RColorBrewer 1.1-3
## glmnet glmnet 4.1-10
## Matrix Matrix 1.7-3
## ranger ranger 0.17.0
## pROC pROC 1.19.0.1
## randomForest randomForest 4.7-1.2
## nnet nnet 7.3-20
## caret caret 7.0-1
## lattice lattice 0.22-7
## kableExtra kableExtra 1.4.0
## knitr knitr 1.51
## scales scales 1.4.0
## patchwork patchwork 1.3.2
## lubridate lubridate 1.9.4
## forcats forcats 1.0.1
## stringr stringr 1.5.2
## dplyr dplyr 1.1.4
## purrr purrr 1.2.0
## readr readr 2.1.6
## tidyr tidyr 1.3.2
## tibble tibble 3.3.0
## ggplot2 ggplot2 4.0.1
## tidyverse tidyverse 2.0.0
# Check for missing values
missing_values <- colSums(is.na(df_cleaned))
missing_df <- data.frame(
Column = names(missing_values),
Missing_Values = missing_values,
Percentage = round(missing_values / nrow(df_cleaned) * 100, 2)
)
cat("Missing Values Analysis:\n")
## Missing Values Analysis:
print(missing_df[missing_df$Missing_Values > 0, ])
## Column Missing_Values Percentage
## Title Description Title Description 3 0.05
# Check for duplicates
duplicates <- sum(duplicated(df_cleaned))
cat("\nNumber of duplicate rows:", duplicates, "\n")
##
## Number of duplicate rows: 0
# Data range validation
cat("\nData Validation:\n")
##
## Data Validation:
cat("- Fiscal Year range:", min(df_cleaned$`Fiscal Year`, na.rm = TRUE), "to",
max(df_cleaned$`Fiscal Year`, na.rm = TRUE), "\n")
## - Fiscal Year range: 2015 to 2017
cat("- Base Salary range: $", min(df_cleaned$`Base Salary`, na.rm = TRUE), "to $",
max(df_cleaned$`Base Salary`, na.rm = TRUE), "\n")
## - Base Salary range: $ 1 to $ 191170.2
cat("- Data covers", length(unique(df_cleaned$`Fiscal Year`)), "fiscal years\n")
## - Data covers 3 fiscal years
Report Generated: 2026-01-13 00:48:17
Data Source: New York City - Citywide Payroll Data
(Fiscal Years 2015-2017)
Analysis Team: Data Visualization Group
Report Version: 1.0 - With Full Code and
Visualizations
This classification analysis aims to predict employee leave status (Active, On Leave, Ceased, On Separation Leave) using payroll, job, and organisational attributes from the NYC Citywide Payroll dataset.
The rationale for this analysis is to examine whether observable employment and compensation patterns can meaningfully distinguish different leave outcomes, which reflects common workforce analytics use cases such as monitoring employment stability, identifying departments with higher exit risk, and understanding structural differences in employment status.
By framing leave status as a supervised classification problem, the analysis also allows comparison of different modelling approaches and highlights the impact of class imbalance and data quality on predictive performance.
df_analysis <- read_csv("cleaned_payroll_data.csv")
# Set random seed to reproduce result in every iteration
set.seed(123)
# Convert the 'Leave Status as of June 30' column into a categorical factor variable.
# The factor levels are explicitly specified to:
# 1) Standardise the leave status categories,
# 2) Ensure consistent ordering of classes for analysis and modelling,
# 3) Prevent R from automatically inferring or reordering levels.
df_analysis$LeaveStatus <- factor(
df_analysis$`Leave Status as of June 30`,
levels = c("active", "on leave", "ceased", "on separation leave")
)
# Standardise column names by replacing spaces with underscores.
df_analysis <- df_analysis %>%
rename_with(~ gsub(" ", "_", .x))
head(df_analysis)
## # A tibble: 6 × 18
## Fiscal_Year Agency_Name Last_Name First_Name Mid_Init Agency_Start_Date
## <dbl> <chr> <chr> <chr> <chr> <date>
## 1 2016 district attorney… cohn addam j 2015-05-21
## 2 2016 district attorney… cornell jacqueline e 2011-05-16
## 3 2016 district attorney… kermani leila m 1999-09-07
## 4 2016 district attorney… sangerma… gregory m 2002-09-03
## 5 2016 district attorney… scott gainsworth unknown 2013-01-07
## 6 2016 district attorney… walton sherita m 2008-09-02
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## # Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## # Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## # Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## # LeaveStatus <fct>
# Create a modelling dataset by selecting only variables relevant to the classification task (Leave Status).
model_data <- df_analysis %>%
select(
LeaveStatus,
Work_Location_Borough,
Pay_Basis,
Base_Salary,
Regular_Hours,
Regular_Gross_Paid,
OT_Hours,
Total_OT_Paid,
Total_Other_Pay
)
# Convert selected categorical predictors into factor variables.
model_data <- model_data %>%
mutate(
across(
c(`Work_Location_Borough`, `Pay_Basis`),
as.factor
)
)
# Remove rows with missing target variable
model_data <- model_data %>%
filter(!is.na(LeaveStatus)) %>%
droplevels()
# Split the dataset into training and testing sets using stratified sampling.
train_index <- createDataPartition(
model_data$LeaveStatus,
p = 0.8,
list = FALSE
)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
# Fit a multinomial logistic regression model to predict Leave Status.
multinom_model <- multinom(
LeaveStatus ~ .,
data = train_data,
trace = FALSE
)
# Generate predicted leave status classes
multinom_pred <- predict(multinom_model, test_data)
# Evaluate model performance by comparing predicted classes
confusionMatrix(
multinom_pred,
test_data$LeaveStatus
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction active on leave ceased on separation leave
## active 1096 10 94 1
## on leave 0 0 0 0
## ceased 15 1 69 0
## on separation leave 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.9059
## 95% CI : (0.8886, 0.9213)
## No Information Rate : 0.8639
## P-Value [Acc > NIR] : 2.583e-06
##
## Kappa : 0.4909
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: active Class: on leave Class: ceased
## Sensitivity 0.9865 0.000000 0.42331
## Specificity 0.4000 1.000000 0.98575
## Pos Pred Value 0.9126 NaN 0.81176
## Neg Pred Value 0.8235 0.991446 0.92173
## Prevalence 0.8639 0.008554 0.12675
## Detection Rate 0.8523 0.000000 0.05365
## Detection Prevalence 0.9339 0.000000 0.06610
## Balanced Accuracy 0.6932 0.500000 0.70453
## Class: on separation leave
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9992224
## Prevalence 0.0007776
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
# Train a Random Forest classification model to predict Leave Status.
rf_model <- ranger(
LeaveStatus ~ .,
data = train_data,
num.trees = 300,
importance = "impurity",
probability = FALSE
)
# Generate predicted class labels for the test dataset.
rf_pred <- predict(rf_model, test_data)$predictions
# Evaluate the Random Forest model by comparing predicted classes
confusionMatrix(
as.factor(rf_pred),
test_data$LeaveStatus
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction active on leave ceased on separation leave
## active 1095 10 57 1
## on leave 0 0 0 0
## ceased 16 1 106 0
## on separation leave 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.9339
## 95% CI : (0.9189, 0.9469)
## No Information Rate : 0.8639
## P-Value [Acc > NIR] : 8.61e-16
##
## Kappa : 0.6801
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: active Class: on leave Class: ceased
## Sensitivity 0.9856 0.000000 0.65031
## Specificity 0.6114 1.000000 0.98486
## Pos Pred Value 0.9415 NaN 0.86179
## Neg Pred Value 0.8699 0.991446 0.95099
## Prevalence 0.8639 0.008554 0.12675
## Detection Rate 0.8515 0.000000 0.08243
## Detection Prevalence 0.9044 0.000000 0.09565
## Balanced Accuracy 0.7985 0.500000 0.81758
## Class: on separation leave
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9992224
## Prevalence 0.0007776
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
final_predictions <- test_data %>%
mutate(
Predicted_LeaveStatus = rf_pred
)
head(final_predictions)
## # A tibble: 6 × 10
## LeaveStatus Work_Location_Borough Pay_Basis Base_Salary Regular_Hours
## <fct> <fct> <fct> <dbl> <dbl>
## 1 active brooklyn per annum 90665 1830
## 2 active queens per annum 66500 1830
## 3 ceased queens per annum 191170. 1525
## 4 active bronx per annum 30516 1806.
## 5 active brooklyn per annum 57747 2091.
## 6 active brooklyn per annum 103585 2091.
## # ℹ 5 more variables: Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## # Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, Predicted_LeaveStatus <fct>
Both models demonstrate strong overall performance in predicting employee leave status. The multinomial logistic regression achieves an accuracy of 91%, while the random forest model improves this slightly to 94%, indicating a modest but consistent gain from using a non-linear, tree-based approach.
Examination of the confusion matrices shows that both models classify the “active” and “ceased” categories accurately, with very few misclassifications. However, neither model successfully predicts the rare classes “on leave” and “on separation leave”, which are almost never identified due to their extremely low prevalence in the dataset.
Overall, the random forest model exhibits marginally better classification performance, particularly in reducing misclassification between the active and ceased categories, but both models are heavily influenced by class imbalance when predicting less frequent leave statuses. ***
The objective of this regression analysis is to predict the annual income of New York City employees based on their job-related and employment characteristics. By building a regression model, this analysis aims to quantify how factors such as job title, base salary, agency affiliation, work hours, and tenure contribute to variations in employees’ annual income.
df <- df_analysis
dim(df)
## [1] 6510 18
head(df, 3)
## # A tibble: 3 × 18
## Fiscal_Year Agency_Name Last_Name First_Name Mid_Init Agency_Start_Date
## <dbl> <chr> <chr> <chr> <chr> <date>
## 1 2016 district attorney… cohn addam j 2015-05-21
## 2 2016 district attorney… cornell jacqueline e 2011-05-16
## 3 2016 district attorney… kermani leila m 1999-09-07
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## # Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## # Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## # Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## # LeaveStatus <fct>
str(df)
## spc_tbl_ [6,510 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Fiscal_Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency_Name : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
## $ Last_Name : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
## $ First_Name : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
## $ Mid_Init : chr [1:6510] "j" "e" "m" "m" ...
## $ Agency_Start_Date : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
## $ Work_Location_Borough : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
## $ Title_Description : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
## $ Leave_Status_as_of_June_30: chr [1:6510] "ceased" "ceased" "active" "active" ...
## $ Base_Salary : num [1:6510] 1 73000 133500 106000 60597 ...
## $ Pay_Basis : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
## $ Regular_Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ Regular_Gross_Paid : num [1:6510] 1750 31943 132415 105248 59831 ...
## $ OT_Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_OT_Paid : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Other_Pay : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
## $ employee_id : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
## $ LeaveStatus : Factor w/ 4 levels "active","on leave",..: 3 3 1 1 1 4 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. `Fiscal Year` = col_double(),
## .. `Agency Name` = col_character(),
## .. `Last Name` = col_character(),
## .. `First Name` = col_character(),
## .. `Mid Init` = col_character(),
## .. `Agency Start Date` = col_date(format = ""),
## .. `Work Location Borough` = col_character(),
## .. `Title Description` = col_character(),
## .. `Leave Status as of June 30` = col_character(),
## .. `Base Salary` = col_double(),
## .. `Pay Basis` = col_character(),
## .. `Regular Hours` = col_double(),
## .. `Regular Gross Paid` = col_double(),
## .. `OT Hours` = col_double(),
## .. `Total OT Paid` = col_double(),
## .. `Total Other Pay` = col_double(),
## .. employee_id = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
In this study, the target variable is defined as annual income, representing the total compensation an employee receives within a fiscal year. Annual income is calculated as the sum of three components: Regular Gross Paid, Total OT Paid, and Total Other Pay.
Since these components are used to construct the target variable, they are excluded from the set of independent variables in the regression model to prevent data leakage and ensure a fair evaluation of model performance.
# Construct target variable: Annual Income,
#Define the target variable (annual income) as the sum of all income components
df$annual_income <- df$Regular_Gross_Paid + df$Total_OT_Paid + df$Total_Other_Pay
# Examine the distribution of the target variable
summary(df$annual_income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2836 34943 45596 77180 282416
# Convert agency start date to Date format
df$Agency_Start_Date <- as.Date(df$Agency_Start_Date)
# Use fiscal year end (June 30) as the reference point
fy_end <- as.Date(paste0(df$Fiscal_Year, "-06-30"))
## Calculate employee tenure in years
df$tenure_years <- as.numeric(fy_end - df$Agency_Start_Date) / 365.25
# Inspect tenure distribution
summary(df$tenure_years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1561 2.8836 9.5797 11.4987 17.4976 51.4059
# Define target variable
y <- df$annual_income
The selection of explanatory variables is based on the cleaned payroll dataset and includes attributes that are expected to influence employee income. These variables include fiscal year, agency name, work location borough, job title, leave status, base salary, pay basis, regular working hours, and overtime hours.
In addition, an engineered feature, employee tenure (tenure_years), is created using the employee’s agency start date and the fiscal year end date. This variable captures the length of service and reflects an employee’s career stage within the organization.
To facilitate regression modeling, all categorical variables are converted into factor variables, allowing them to be automatically encoded as dummy variables during model fitting.
# Specify columns to exclude
# - Target variable
# - Income components (to avoid data leakage)
# - ID and name-related variables
# - Raw start date (already transformed into tenure_years)
drop_cols <- c(
"annual_income",
"Regular_Gross_Paid", "Total_OT_Paid", "Total_Other_Pay",
"First_Name", "Last_Name", "Mid_Init",
"Agency_Start_Date",
"employee_id"
)
# Select predictor variables
keep_cols <- setdiff(names(df), drop_cols)
X <- df[, keep_cols]
# Convert character variables to factors for regression modeling
char_cols <- sapply(X, is.character)
X[char_cols] <- lapply(X[char_cols], as.factor)
# Check structure of predictors
str(X)
## tibble [6,510 × 11] (S3: tbl_df/tbl/data.frame)
## $ Fiscal_Year : num [1:6510] 2016 2016 2016 2016 2016 ...
## $ Agency_Name : Factor w/ 65 levels "admin for children's svcs",..: 43 43 43 43 43 43 45 46 46 46 ...
## $ Work_Location_Borough : Factor w/ 8 levels "bronx","brooklyn",..: 3 3 3 3 3 3 2 5 5 5 ...
## $ Title_Description : Factor w/ 334 levels "12 month special education asst. principal",..: 104 39 39 39 107 39 39 39 39 39 ...
## $ Leave_Status_as_of_June_30: Factor w/ 5 levels "active","ceased",..: 2 2 1 1 1 4 1 1 1 1 ...
## $ Base_Salary : num [1:6510] 1 73000 133500 106000 60597 ...
## $ Pay_Basis : Factor w/ 4 levels "per annum","per day",..: 3 1 1 1 1 1 1 1 1 1 ...
## $ Regular_Hours : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
## $ OT_Hours : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
## $ LeaveStatus : Factor w/ 4 levels "active","on leave",..: 3 3 1 1 1 4 1 1 1 1 ...
## $ tenure_years : num [1:6510] 1.11 5.13 16.81 13.82 3.48 ...
The dataset is randomly split into a training set (80%) and a testing set (20%) to evaluate the model’s generalization performance.
As a benchmark, a baseline model is constructed by predicting the mean annual income of the training data for all observations in the test set. The baseline results indicate high prediction error and an R² value close to zero, demonstrating that this simple approach fails to explain variations in employee income and serves only as a reference for comparison.
set.seed(42)
# Total number of observations
n <- nrow(df)
# Randomly sample 80% of data for training
idx <- sample(seq_len(n), size = floor(0.8 * n))
# Create training and testing sets
X_train <- X[idx, ]
X_test <- X[-idx, ]
y_train <- y[idx]
y_test <- y[-idx]
# Define evaluation metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2, na.rm = TRUE))
mae <- function(actual, pred) mean(abs(actual - pred), na.rm = TRUE)
r2 <- function(actual, pred) {
ss_res <- sum((actual - pred)^2, na.rm = TRUE)
ss_tot <- sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
1 - ss_res/ss_tot
}
# Baseline prediction using training mean
base_pred <- rep(mean(y_train, na.rm = TRUE), length(y_test))
# Evaluate baseline model
cat("Baseline:\n")
## Baseline:
cat("RMSE:", rmse(y_test, base_pred), "\n")
## RMSE: 46112.47
cat("MAE :", mae(y_test, base_pred), "\n")
## MAE : 38827.99
cat("R2 :", r2(y_test, base_pred), "\n")
## R2 : -1.912583e-05
To improve predictive performance and address potential multicollinearity among predictors, a Ridge regression model is employed. Given the right-skewed distribution of income, the target variable is log-transformed using log(1 + income) prior to model training.
The optimal regularization parameter is selected using 5-fold cross-validation. The Ridge regression model significantly outperforms the baseline model, achieving substantially lower RMSE and MAE values and a much higher R², indicating strong explanatory and predictive capability.
# Combine train and test data for consistent preprocessing
X_all <- rbind(X_train, X_test)
fill_missing <- function(d){
for (nm in names(d)) {
if (is.numeric(d[[nm]]) || is.integer(d[[nm]])) {
med <- median(d[[nm]], na.rm = TRUE)
d[[nm]][is.na(d[[nm]])] <- med
} else {
d[[nm]] <- as.factor(d[[nm]])
d[[nm]] <- as.character(d[[nm]])
d[[nm]][is.na(d[[nm]])] <- "Missing"
d[[nm]] <- as.factor(d[[nm]])
}
}
d
}
# Apply missing value handling
X_all2 <- fill_missing(X_all)
X_train2 <- X_all2[1:nrow(X_train), ]
X_test2 <- X_all2[(nrow(X_train)+1):nrow(X_all2), ]
# Create model matrices (one-hot encoding)
mm_train <- model.matrix(~ . - 1, data = X_train2)
mm_test <- model.matrix(~ . - 1, data = X_test2)
# Perform cross-validation to select optimal lambda
set.seed(42)
cv_ridge <- cv.glmnet(
x = mm_train,
y = log1p(y_train),
alpha = 0, # Ridge
nfolds = 5
)
best_lambda <- cv_ridge$lambda.min
best_lambda
## [1] 0.1508895
plot(cv_ridge)
# Train final Ridge regression model
ridge_fit <- glmnet(
x = mm_train,
y = log1p(y_train),
alpha = 0,
lambda = best_lambda
)
# Generate predictions and transform back to original scale
pred_log <- predict(ridge_fit, newx = mm_test)
pred <- expm1(as.numeric(pred_log))
# Evaluate Ridge regression model
cat("Ridge (log-target):\n")
## Ridge (log-target):
cat("RMSE:", rmse(y_test, pred), "\n")
## RMSE: 18389.07
cat("MAE :", mae(y_test, pred), "\n")
## MAE : 9914.029
cat("R2 :", r2(y_test, pred), "\n")
## R2 : 0.8409656
To enhance interpretability, the coefficients of the Ridge regression model are examined. By ranking predictors based on the absolute magnitude of their coefficients, the most influential factors affecting annual income can be identified.
The results show that job title (Title Description) is the dominant predictor, reflecting the structured nature of compensation within the New York City government. Other important predictors include base salary, agency affiliation, and pay basis. The direction and magnitude of the coefficients are consistent with real-world salary structures, supporting the validity of the model.
# Extract model coefficients
coef_vec <- as.matrix(coef(ridge_fit))
coef_df <- data.frame(
feature = rownames(coef_vec),
coef = as.numeric(coef_vec)
)
# Remove intercept and compute absolute values
coef_df <- coef_df[coef_df$feature != "(Intercept)", ]
coef_df$abs_coef <- abs(coef_df$coef)
# Display top 20 most influential predictors
top20 <- coef_df[order(-coef_df$abs_coef), ][1:20, ]
top20
## feature coef
## 285 Title_Descriptionnon-teaching adjunct ii -3.413249
## 348 Title_Descriptionsewage treatment worker -3.336435
## 369 Title_Descriptionsuperintendent of water and sewer systems 2.697947
## 148 Title_Descriptioncall center representative -2.687453
## 364 Title_Descriptionsubstitute school aide -2.631358
## 166 Title_Descriptioncity planner 2.441522
## 338 Title_Descriptionsenior occupational therapist 2.413147
## 382 Title_Descriptionsupervisor of school security 2.275472
## 215 Title_Descriptionelectrician 2.275146
## 344 Title_Descriptionsergeant d/a special assignment 2.247576
## 350 Title_Descriptionsocial worker -2.223493
## 77 Title_Descriptionadjunct college lab tech -2.170581
## 24 Agency_Namedepartment of city planning -2.144856
## 170 Title_Descriptioncity service aide -2.140913
## 79 Title_Descriptionadjunct professor -2.118765
## 86 Title_Descriptionadministrative accountant 2.078307
## 255 Title_Descriptionhousing caretaker 2.050431
## 359 Title_Descriptionstationary engineer 2.044397
## 286 Title_Descriptionnon-teaching adjunct iii -2.013277
## 92 Title_Descriptionadministrative education officer 2.003008
## abs_coef
## 285 3.413249
## 348 3.336435
## 369 2.697947
## 148 2.687453
## 364 2.631358
## 166 2.441522
## 338 2.413147
## 382 2.275472
## 215 2.275146
## 344 2.247576
## 350 2.223493
## 77 2.170581
## 24 2.144856
## 170 2.140913
## 79 2.118765
## 86 2.078307
## 255 2.050431
## 359 2.044397
## 286 2.013277
## 92 2.003008
An error analysis is conducted to investigate cases with the largest prediction errors. The analysis reveals that higher errors tend to occur in specific job roles or agencies where income structures are more complex or less consistent.
These discrepancies may result from irregular payments, partial-year employment, or special allowances that are not fully captured by the available predictors. Despite these limitations, the model demonstrates strong overall performance, and such errors are expected in real-world payroll data.
# Create error analysis dataset
res <- data.frame(
actual = y_test,
pred = pred
)
# Compute prediction errors
res$error <- res$pred - res$actual
res$abs_error <- abs(res$error)
# Inspect cases with the largest absolute errors
head(res[order(-res$abs_error), ], 10)
## actual pred error abs_error
## 1171 85721.48 330747.574 245026.09 245026.09
## 350 167583.17 36593.827 -130989.34 130989.34
## 945 92418.76 198446.096 106027.34 106027.34
## 817 136280.46 32067.900 -104212.56 104212.56
## 363 104492.25 1686.626 -102805.62 102805.62
## 202 122551.33 21411.060 -101140.27 101140.27
## 1273 109184.78 14160.450 -95024.33 95024.33
## 1193 196161.99 289882.386 93720.40 93720.40
## 349 132518.37 39022.946 -93495.42 93495.42
## 1202 82484.88 173651.568 91166.69 91166.69
summary(res$abs_error)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.030e-01 1.244e+03 4.956e+03 9.914e+03 1.292e+04 2.450e+05
# Combine errors with original test features for interpretation
res2 <- cbind(res, X_test2)
head(res2[order(-res2$abs_error), c("actual","pred","abs_error","Title_Description","Agency_Name")], 10)
## actual pred abs_error Title_Description
## 1171 85721.48 330747.574 245026.09 certified it administrator
## 350 167583.17 36593.827 130989.34 custodian engineer
## 945 92418.76 198446.096 106027.34 agency attorney
## 817 136280.46 32067.900 104212.56 custodian engineer
## 363 104492.25 1686.626 102805.62 firefighter
## 202 122551.33 21411.060 101140.27 principal
## 1273 109184.78 14160.450 95024.33 police officer
## 1193 196161.99 289882.386 93720.40 captain
## 349 132518.37 39022.946 93495.42 custodian engineer
## 1202 82484.88 173651.568 91166.69 agency attorney
## Agency_Name
## 1171 dept of parks & recreation
## 350 doe custodial payrol
## 945 department of education admin
## 817 doe custodial payrol
## 363 fire department
## 202 dept of ed pedagogical
## 1273 police department
## 1193 fire department
## 349 doe custodial payrol
## 1202 housing preservation & dvlpmnt
The regression results demonstrate that the Ridge regression model provides a strong and reliable prediction of employee annual income. Compared to the baseline model, the Ridge model achieves a substantially higher R² value and significantly lower prediction errors, indicating that the selected predictors capture most of the meaningful variation in employee compensation.
The dominance of job title variables among the most influential predictors highlights the structured nature of salary determination within the New York City government. Since compensation is largely determined by job classification, role responsibility, and union or pay grade systems, it is reasonable that job title plays a central role in predicting income. Base salary, agency affiliation, and pay basis further refine these predictions by capturing institutional and contractual differences across departments.
The use of a log-transformed target variable improves model stability by reducing the influence of extreme income values. This transformation allows the model to better capture relative income differences across employees, particularly in a dataset with a right-skewed income distribution.
Error analysis reveals that the largest prediction errors tend to occur in specific job roles or agencies with complex or irregular compensation structures. In such cases, actual income may be influenced by partial-year employment, special allowances, or atypical payment arrangements that are not fully represented by the available features. These findings suggest that while the model performs well overall, incorporating additional variables related to employment duration or special compensation types could further improve accuracy.
Overall, the results indicate that the Ridge regression model is well-suited for modeling employee income in this context and provides meaningful insights into the factors driving salary variation.
The regression analysis successfully models the annual income of New York City employees using payroll and job-related data. By comparing a baseline approach with a Ridge regression model, the analysis demonstrates that regularized regression significantly improves predictive performance and explanatory power.
The classification analysis also creates a reliable and robust model, with high accuracies on predicting employee leave status, which helps in risk management for human resources.
With the success of the analyses, we hope that the project will set a precedence for future work to look for feasibilities in extending the analysis onto local markets to evaluate the generalizability. ***