WQD7004 Group Project - Group 11

Team Members

Chong Zhan Hang (24077643)

Shi Yaqi (24227144)

Long Yun (24213952)

Liang Shengbao (24200736)

Huang Xinyi (25058110)

Introduction

This project applies data science techniques to a large-scale, real-world administrative dataset to demonstrate data cleaning, exploratory analysis, and predictive modelling using R.

The NYC Citywide Payroll dataset was selected due to its size, complexity, and inherent data quality issues, which closely reflect practical challenges encountered in organisational and public-sector analytics.

The dataset contains detailed employee, payroll, and job-related information across multiple city agencies, making it well suited for both regression and classification problems.

Through this project, the group aims to transform raw, noisy payroll data into a structured analytical dataset and extract meaningful insights related to employee compensation and employment status.

Project Objectives

The objectives of this project are as follows:

To explore and understand the structure, content, and limitations of the NYC Citywide Payroll dataset through systematic data inspection.
To perform comprehensive data cleaning and preprocessing, including handling missing values, standardising categorical variables, and resolving inconsistencies in administrative records.
To present findings through clear visualisation, model evaluation metrics, and concise interpretation suitable for decision-making contexts.
To develop a regression model to analyse factors influencing employee compensation using payroll and job-related attributes.
To build and evaluate classification models to predict employee leave status based on employment and compensation characteristics.
To compare the performance of different modelling approaches.

Data Cleaning

1. Data Loading and Fiscal Year Filtering

This section covers loading the raw CSV data and filtering it to include only records from the latest three consecutive fiscal years. Load the original CSV dataset and filter for records from the latest three consecutive fiscal years.

R Code

library(readr)
library(dplyr)

# Load the dataset
df <- read_csv('Citywide_Payroll_Data__Fiscal_Year_.csv', show_col_types = FALSE)

# Identify and filter for the latest three fiscal years
unique_fiscal_years <- unique(df$`Fiscal Year`)
sorted_fiscal_years <- sort(unique_fiscal_years, decreasing = TRUE)
latest_three_fiscal_years <- head(sorted_fiscal_years, 3)
df_filtered <- df %>% filter(`Fiscal Year` %in% latest_three_fiscal_years)

cat(paste("Original data loaded and filtered for latest three fiscal years:", toString(latest_three_fiscal_years), "\n"))

## Original data loaded and filtered for latest three fiscal years: 2017, 2016, 2015

head(df_filtered)

## # A tibble: 6 × 16
##   `Fiscal Year` `Agency Name`               `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                       <chr>       <chr>        <chr>     
## 1          2016 DISTRICT ATTORNEY-MANHATTAN ABA'AHMID   RAHASHEEM    E         
## 2          2016 DISTRICT ATTORNEY-MANHATTAN ABENSUR     MARGARET     <NA>      
## 3          2016 DISTRICT ATTORNEY-MANHATTAN ABOUNAOUM   ANDREA       L         
## 4          2016 DISTRICT ATTORNEY-MANHATTAN ABRAHAM     JONATHAN     J         
## 5          2016 DISTRICT ATTORNEY-MANHATTAN ABRAMS      JOSEPH       <NA>      
## 6          2016 DISTRICT ATTORNEY-MANHATTAN ABREU       JENNIFER     <NA>      
## # ℹ 11 more variables: `Agency Start Date` <chr>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## #   `Total OT Paid` <chr>, `Total Other Pay` <chr>

Explanation First, the necessary libraries readr and dplyr are loaded. The Citywide_Payroll_Data__Fiscal_Year_.csv file is then loaded into a dataframe named df. To identify the latest three fiscal years, unique fiscal years are extracted, sorted in descending order, and the top three are selected. Finally, df_filtered is created by filtering the original df to include only records corresponding to these three latest fiscal years. This ensures that our analysis focuses on the most recent data.

2. Unique Employee ID Creation and Sampling

This section details the creation of unique employee identifiers and the subsequent structured sampling of the dataset. Create an approximately unique employee_id by combining ‘Last Name’, ‘First Name’, and ‘Agency Start Date’. Then, randomly sample 2500 unique employee_ids from the filtered data.

R Code

df_filtered <- df_filtered %>%
  mutate(
    `Last Name` = ifelse(is.na(`Last Name`), "", as.character(`Last Name`)),
    `First Name` = ifelse(is.na(`First Name`), "", as.character(`First Name`)),
    `Agency Start Date` = ifelse(is.na(`Agency Start Date`), "", as.character(`Agency Start Date`)),
    employee_id = paste0(`Last Name`, `First Name`, `Agency Start Date`)
  )

unique_employee_ids <- unique(df_filtered$employee_id)
sampled_employee_ids <- sample(unique_employee_ids, min(2500, length(unique_employee_ids)))
df_sampled <- df_filtered %>% filter(employee_id %in% sampled_employee_ids)

cat(paste("Number of unique employee IDs created:", length(unique_employee_ids), "\n"))

## Number of unique employee IDs created: 642096

cat(paste("Number of employee IDs sampled:", length(sampled_employee_ids), "\n"))

## Number of employee IDs sampled: 2500

cat(paste("Number of records after sampling:", nrow(df_sampled), "\n"))

## Number of records after sampling: 6510

head(df_sampled)

## # A tibble: 6 × 17
##   `Fiscal Year` `Agency Name`               `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                       <chr>       <chr>        <chr>     
## 1          2016 DISTRICT ATTORNEY-MANHATTAN COHN        ADDAM        J         
## 2          2016 DISTRICT ATTORNEY-MANHATTAN CORNELL     JACQUELINE   E         
## 3          2016 DISTRICT ATTORNEY-MANHATTAN KERMANI     LEILA        M         
## 4          2016 DISTRICT ATTORNEY-MANHATTAN SANGERMANO  GREGORY      M         
## 5          2016 DISTRICT ATTORNEY-MANHATTAN SCOTT       GAINSWORTH   <NA>      
## 6          2016 DISTRICT ATTORNEY-MANHATTAN WALTON      SHERITA      M         
## # ℹ 12 more variables: `Agency Start Date` <chr>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## #   `Total OT Paid` <chr>, `Total Other Pay` <chr>, employee_id <chr>

Explanation An employee_id is created by concatenating Last Name, First Name, and Agency Start Date to provide a pseudo-unique identifier for each employee. Missing values in these columns are replaced with empty strings before concatenation to avoid NA values in employee_id. From the df_filtered dataset, all unique employee_ids are extracted, and a random sample of 2500 (or fewer if less than 2500 unique IDs exist) is selected. Finally, df_sampled is created by filtering df_filtered to include only records belonging to these sampled employee_ids, ensuring a representative subset for further cleaning and analysis.

3. Initial Data Overview

This section provides a preliminary overview of the sampled dataset. Perform an initial overview of df_sampled using R functions like str(), dim(), and summary() to check row count, column types, and basic statistics.

R Code

str(df_sampled)

## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6510] "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" ...
##  $ Last Name                 : chr [1:6510] "COHN" "CORNELL" "KERMANI" "SANGERMANO" ...
##  $ First Name                : chr [1:6510] "ADDAM" "JACQUELINE" "LEILA" "GREGORY" ...
##  $ Mid Init                  : chr [1:6510] "J" "E" "M" "M" ...
##  $ Agency Start Date         : chr [1:6510] "05/21/2015" "05/16/2011" "09/07/1999" "09/03/2002" ...
##  $ Work Location Borough     : chr [1:6510] "MANHATTAN" "MANHATTAN" "MANHATTAN" "MANHATTAN" ...
##  $ Title Description         : chr [1:6510] "COLLEGE AIDE" "ASSISTANT DISTRICT ATTORNEY" "ASSISTANT DISTRICT ATTORNEY" "ASSISTANT DISTRICT ATTORNEY" ...
##  $ Leave Status as of June 30: chr [1:6510] "CEASED" "CEASED" "ACTIVE" "ACTIVE" ...
##  $ Base Salary               : chr [1:6510] "$1.00" "$73000.00" "$133500.00" "$106000.00" ...
##  $ Pay Basis                 : chr [1:6510] "per Hour" "per Annum" "per Annum" "per Annum" ...
##  $ Regular Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ Regular Gross Paid        : chr [1:6510] "$1750.00" "$31942.53" "$132414.94" "$105247.84" ...
##  $ OT Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total OT Paid             : chr [1:6510] "$0.00" "$0.00" "$0.00" "$0.00" ...
##  $ Total Other Pay           : chr [1:6510] "$0.00" "$0.00" "$0.00" "$0.00" ...
##  $ employee_id               : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...

dim(df_sampled)

## [1] 6510   17

summary(df_sampled)

##   Fiscal Year   Agency Name         Last Name          First Name       
##  Min.   :2015   Length:6510        Length:6510        Length:6510       
##  1st Qu.:2015   Class :character   Class :character   Class :character  
##  Median :2016   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2017                                                           
##  Max.   :2017                                                           
##    Mid Init         Agency Start Date  Work Location Borough Title Description 
##  Length:6510        Length:6510        Length:6510           Length:6510       
##  Class :character   Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character   Mode  :character      Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##  Leave Status as of June 30 Base Salary         Pay Basis        
##  Length:6510                Length:6510        Length:6510       
##  Class :character           Class :character   Class :character  
##  Mode  :character           Mode  :character   Mode  :character  
##                                                                  
##                                                                  
##                                                                  
##  Regular Hours    Regular Gross Paid    OT Hours       Total OT Paid     
##  Min.   :   0.0   Length:6510        Min.   :   0.00   Length:6510       
##  1st Qu.:   0.0   Class :character   1st Qu.:   0.00   Class :character  
##  Median :   0.0   Mode  :character   Median :   0.00   Mode  :character  
##  Mean   : 654.4                      Mean   :  61.61                     
##  3rd Qu.:1825.0                      3rd Qu.:   1.25                     
##  Max.   :2223.4                      Max.   :1150.50                     
##  Total Other Pay    employee_id       
##  Length:6510        Length:6510       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Explanation To understand the structure and basic characteristics of the df_sampled dataframe, str() is used to inspect the data types of each column, dim() provides the dimensions (number of rows and columns), and summary() generates descriptive statistics for each column, including min, max, mean, median, and quartile values for numerical columns, and frequency counts for categorical ones. This step is crucial for identifying potential issues like incorrect data types or initial data distributions.

4. Type Conversion and Standardization

This section details the conversion of data types and standardization of text fields. Convert currency/pay columns to numeric, date columns to date type, handle negative values in salary/pay columns by setting them to 0, and standardize categorical text columns.

R Code

df_cleaned <- df_sampled %>%
  mutate(
    # Convert currency columns to numeric
    `Base Salary` = as.numeric(gsub("\\$|,", "", `Base Salary`)),
    `Regular Gross Paid` = as.numeric(gsub("\\$|,|", "", `Regular Gross Paid`)),
    `Total OT Paid` = as.numeric(gsub("\\$|,|", "", `Total OT Paid`)),
    `Total Other Pay` = as.numeric(gsub("\\$|,|", "", `Total Other Pay`)),

    # Convert Agency Start Date to date type
    `Agency Start Date` = as.Date(`Agency Start Date`, format = "%m/%d/%Y")
  ) %>%
  mutate(
    # Replace negative values with 0 in salary/pay columns and Regular Hours
    `Base Salary` = ifelse(`Base Salary` < 0, 0, `Base Salary`),
    `Regular Gross Paid` = ifelse(`Regular Gross Paid` < 0, 0, `Regular Gross Paid`),
    `Total OT Paid` = ifelse(`Total OT Paid` < 0, 0, `Total OT Paid`),
    `Total Other Pay` = ifelse(`Total Other Pay` < 0, 0, `Total Other Pay`),
    `Regular Hours` = ifelse(`Regular Hours` < 0, 0, `Regular Hours`)
  ) %>%
  mutate(
    # Standardize categorical text columns
    `Agency Name` = tolower(trimws(`Agency Name`)),
    `Last Name` = tolower(trimws(`Last Name`)),
    `First Name` = tolower(trimws(`First Name`)),
    `Mid Init` = tolower(trimws(`Mid Init`)),
    `Work Location Borough` = tolower(trimws(`Work Location Borough`)),
    `Title Description` = tolower(trimws(`Title Description`)),
    `Pay Basis` = tolower(trimws(`Pay Basis`)),
    `Leave Status as of June 30` = tolower(trimws(`Leave Status as of June 30`))
  )
cat("Data types converted and text columns standardized.\n")

## Data types converted and text columns standardized.

str(df_cleaned)

## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
##  $ Last Name                 : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
##  $ First Name                : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
##  $ Mid Init                  : chr [1:6510] "j" "e" "m" "m" ...
##  $ Agency Start Date         : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
##  $ Work Location Borough     : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title Description         : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
##  $ Leave Status as of June 30: chr [1:6510] "ceased" "ceased" "active" "active" ...
##  $ Base Salary               : num [1:6510] 1 73000 133500 106000 60597 ...
##  $ Pay Basis                 : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
##  $ Regular Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ Regular Gross Paid        : num [1:6510] 1750 31943 132415 105248 59831 ...
##  $ OT Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total OT Paid             : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total Other Pay           : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
##  $ employee_id               : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...

summary(df_cleaned)

##   Fiscal Year   Agency Name         Last Name          First Name       
##  Min.   :2015   Length:6510        Length:6510        Length:6510       
##  1st Qu.:2015   Class :character   Class :character   Class :character  
##  Median :2016   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2017                                                           
##  Max.   :2017                                                           
##    Mid Init         Agency Start Date    Work Location Borough
##  Length:6510        Min.   :1964-02-02   Length:6510          
##  Class :character   1st Qu.:1999-01-19   Class :character     
##  Mode  :character   Median :2006-11-23   Mode  :character     
##                     Mean   :2004-12-24                        
##                     3rd Qu.:2013-07-09                        
##                     Max.   :2017-06-19                        
##  Title Description  Leave Status as of June 30  Base Salary       
##  Length:6510        Length:6510                Min.   :     1.00  
##  Class :character   Class :character           1st Qu.:    33.18  
##  Mode  :character   Mode  :character           Median : 39842.00  
##                                                Mean   : 41585.31  
##                                                3rd Qu.: 76488.00  
##                                                Max.   :204062.00  
##   Pay Basis         Regular Hours    Regular Gross Paid    OT Hours      
##  Length:6510        Min.   :   0.0   Min.   :     0     Min.   :   0.00  
##  Class :character   1st Qu.:   0.0   1st Qu.:  2795     1st Qu.:   0.00  
##  Mode  :character   Median :   0.0   Median : 33306     Median :   0.00  
##                     Mean   : 654.4   Mean   : 39960     Mean   :  61.61  
##                     3rd Qu.:1825.0   3rd Qu.: 70834     3rd Qu.:   1.25  
##                     Max.   :2223.4   Max.   :241863     Max.   :1150.50  
##  Total OT Paid     Total Other Pay employee_id       
##  Min.   :    0.0   Min.   :    0   Length:6510       
##  1st Qu.:    0.0   1st Qu.:    0   Class :character  
##  Median :    0.0   Median :    0   Mode  :character  
##  Mean   : 3422.1   Mean   : 2213                     
##  3rd Qu.:  199.6   3rd Qu.: 1229                     
##  Max.   :87733.1   Max.   :51416

Explanation This section converts critical columns to their appropriate data types and standardizes text. Currency symbols ($) and commas (,) are removed from financial columns (Base Salary, Regular Gross Paid, Total OT Paid, Total Other Pay), and these columns are converted to numeric type. The Agency Start Date column is converted from character to date format. Negative values in Base Salary, Regular Gross Paid, Total OT Paid, Total Other Pay, and Regular Hours are logically set to 0. Finally, various categorical text columns (e.g., Agency Name, Last Name, First Name) are converted to lowercase and leading/trailing whitespace is removed to ensure consistency and facilitate future analysis.

5. Missing Value Handling

This section addresses missing values in specific columns of the dataset. Fill missing values in ‘Work Location Borough’ using the mode, and fill missing values (including empty strings) in ‘Mid Init’, ‘Last Name’, and ‘First Name’ with ‘unknown’.

R Code

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

mode_work_location_borough <- get_mode(df_cleaned$`Work Location Borough`)
df_cleaned <- df_cleaned %>%
  mutate(`Work Location Borough` = ifelse(is.na(`Work Location Borough`), mode_work_location_borough, `Work Location Borough`))

df_cleaned <- df_cleaned %>%
  mutate(
    `Mid Init` = ifelse(is.na(`Mid Init`) | `Mid Init` == "", "unknown", `Mid Init`),
    `Last Name` = ifelse(is.na(`Last Name`) | `Last Name` == "", "unknown", `Last Name`),
    `First Name` = ifelse(is.na(`First Name`) | `First Name` == "", "unknown", `First Name`)
  )
cat("Missing values in specified columns have been handled.\n")

## Missing values in specified columns have been handled.

cat("NA count in Work Location Borough: ", sum(is.na(df_cleaned$`Work Location Borough`)), "\n")

## NA count in Work Location Borough:  0

cat("Empty string count in Work Location Borough: ", sum(df_cleaned$`Work Location Borough` == ""), "\n")

## Empty string count in Work Location Borough:  0

cat("NA count in Mid Init: ", sum(is.na(df_cleaned$`Mid Init`)), "\n")

## NA count in Mid Init:  0

cat("Empty string count in Mid Init: ", sum(df_cleaned$`Mid Init` == ""), "\n")

## Empty string count in Mid Init:  0

cat("NA count in Last Name: ", sum(is.na(df_cleaned$`Last Name`)), "\n")

## NA count in Last Name:  0

cat("Empty string count in Last Name: ", sum(df_cleaned$`Last Name` == ""), "\n")

## Empty string count in Last Name:  0

cat("NA count in First Name: ", sum(is.na(df_cleaned$`First Name`)), "\n")

## NA count in First Name:  0

cat("Empty string count in First Name: ", sum(df_cleaned$`First Name` == ""), "\n")

## Empty string count in First Name:  0

Explanation Missing values are handled to ensure data completeness. For Work Location Borough, a custom get_mode function is defined to calculate the mode (most frequent value), which is then used to impute any NA values in this column. For Mid Init, Last Name, and First Name, both NA values and empty strings are replaced with the placeholder "unknown". This comprehensive imputation ensures no critical categorical data points are left missing.

6. Outlier Treatment

This section describes the identification and treatment of outliers in numerical columns. Apply IQR capping to ‘Base Salary’ while explicitly skipping ‘OT Hours’, ‘Total OT Paid’, and ‘Total Other Pay’.

R Code

cap_outliers_iqr <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  upper_bound <- q3 + 1.5 * iqr
  lower_bound <- q1 - 1.5 * iqr
  x[x > upper_bound] <- upper_bound
  x[x < lower_bound] <- lower_bound
  return(x)
}

cat("Summary of Base Salary BEFORE IQR Capping:\n")

## Summary of Base Salary BEFORE IQR Capping:

print(summary(df_cleaned$`Base Salary`))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00     33.18  39842.00  41585.31  76488.00 204062.00

df_cleaned$`Base Salary` <- cap_outliers_iqr(df_cleaned$`Base Salary`)
cat("\nSummary of Base Salary AFTER IQR Capping:\n")

## 
## Summary of Base Salary AFTER IQR Capping:

print(summary(df_cleaned$`Base Salary`))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00     33.18  39842.00  41582.11  76488.00 191170.23

cat("IQR capping applied to 'Base Salary'. Other specified columns were skipped.\n")

## IQR capping applied to 'Base Salary'. Other specified columns were skipped.

Explanation Outliers in Base Salary were capped using the Interquartile Range (IQR) method. Values falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR were replaced with the respective bounds. This approach mitigates the impact of extreme values on statistical analyses. Other potentially skewed financial columns, such as OT Hours, Total OT Paid, and Total Other Pay, were intentionally excluded from capping. This decision was made because high values in these categories might represent legitimate compensation (e.g., extensive overtime or bonuses) and blanket capping could distort their true variability, which is best handled during the analysis phase (e.g., via log transformations) to preserve information.

7. Duplicate Removal and Factor Conversion

This section focuses on ensuring data uniqueness and converting categorical variables to factor types. Remove all duplicate rows from df_cleaned and convert ‘Agency Name’, ‘Work Location Borough’, ‘Leave Status as of June 30’, ‘Pay Basis’, and ‘Title Description’ to factor types.

R Code

cat(paste("Number of records BEFORE removing duplicates:", nrow(df_cleaned), "\n"))

## Number of records BEFORE removing duplicates: 6510

df_cleaned <- df_cleaned %>%
    distinct()
cat(paste("Number of records AFTER removing duplicates:", nrow(df_cleaned), "\n"))

## Number of records AFTER removing duplicates: 6510

df_cleaned <- df_cleaned %>%
  mutate(
    `Agency Name` = as.factor(`Agency Name`),
    `Work Location Borough` = as.factor(`Work Location Borough`),
    `Leave Status as of June 30` = as.factor(`Leave Status as of June 30`),
    `Pay Basis` = as.factor(`Pay Basis`),
    `Title Description` = as.factor(`Title Description`)
  )
cat("Duplicate rows removed and specified columns converted to factor types.\n")

## Duplicate rows removed and specified columns converted to factor types.

str(df_cleaned)

## tibble [6,510 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : Factor w/ 65 levels "admin for children's svcs",..: 43 43 43 43 43 43 45 46 46 46 ...
##  $ Last Name                 : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
##  $ First Name                : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
##  $ Mid Init                  : chr [1:6510] "j" "e" "m" "m" ...
##  $ Agency Start Date         : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
##  $ Work Location Borough     : Factor w/ 8 levels "bronx","brooklyn",..: 3 3 3 3 3 3 2 5 5 5 ...
##  $ Title Description         : Factor w/ 334 levels "12 month special education asst. principal",..: 104 39 39 39 107 39 39 39 39 39 ...
##  $ Leave Status as of June 30: Factor w/ 5 levels "active","ceased",..: 2 2 1 1 1 4 1 1 1 1 ...
##  $ Base Salary               : num [1:6510] 1 73000 133500 106000 60597 ...
##  $ Pay Basis                 : Factor w/ 4 levels "per annum","per day",..: 3 1 1 1 1 1 1 1 1 1 ...
##  $ Regular Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ Regular Gross Paid        : num [1:6510] 1750 31943 132415 105248 59831 ...
##  $ OT Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total OT Paid             : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total Other Pay           : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
##  $ employee_id               : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...

Explanation To maintain data integrity and prevent biased analyses, all completely duplicate rows were removed from df_cleaned. Following this, key categorical variables such as Agency Name, Work Location Borough, Leave Status as of June 30, Pay Basis, and Title Description were explicitly converted to R’s factor type. This conversion optimizes memory usage for categorical data and prepares these variables for statistical modeling and visualization functions that often require factor inputs.

8. Saving Cleaned Data

This final section details the saving of the processed dataset for future use. Save the cleaned dataset to new CSV and RData files.

R Code

write_csv(df_cleaned, "cleaned_payroll_data.csv")
save(df_cleaned, file = "cleaned_payroll_data.RData")
cat("Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.\n")

## Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.

Explanation The thoroughly cleaned and preprocessed df_cleaned dataset is saved in two formats: as a CSV file (cleaned_payroll_data.csv) and as an RData file (cleaned_payroll_data.RData). Saving in both formats ensures accessibility and ease of use for different R workflows and for sharing with other team members or tools. This step marks the completion of the data cleaning phase, providing a refined dataset ready for subsequent visualization, analysis, and model building.

9. Conclusion and Summary

The entire data cleaning process for the New York City - Citywide Payroll Data has been completed, transforming the raw dataset into a high-quality, analysis-ready format. The key steps undertaken include:

Structured Sampling: The dataset was filtered to include records from the latest three fiscal years (2015-2017). A sample of 2500 unique employees was drawn, resulting in a working dataset of 6538 records, ensuring longitudinal data integrity for attrition prediction.
Data Type Conversion and Standardization: Monetary and date columns were correctly converted. Negative values in all salary, pay, and regular hours columns were logically set to zero. Categorical text fields were standardized to lowercase and trimmed for consistency.
Missing Value Handling: Missing Work Location Borough values were imputed using the mode. Missing or empty strings in Mid Init, Last Name, and First Name were filled with "unknown", crucial for maintaining employee_id integrity.
Outlier Treatment: Outliers in Base Salary were capped using the IQR method to mitigate extreme values’ impact. OT Hours, Total OT Paid, and Total Other Pay were intentionally skipped to preserve their inherent variability.
Duplicate Removal and Factor Conversion: Duplicate rows were checked and removed (none found after sampling). Key categorical variables were converted to R factor types, optimizing them for statistical modeling and analysis.

Outcome: The df_cleaned dataset is now robust, consistent, and free from common data quality issues. It has been saved as cleaned_payroll_data.csv and cleaned_payroll_data.RData and is fully prepared for subsequent data visualization, in-depth analysis, and model building by the respective team members.

Exploratory Data Analysis and Data Visualization

1. Executive Summary

This report provides a comprehensive visualization analysis of New York City payroll data over the last three fiscal years (2015-2017). The report includes both the visualizations and the R code used to generate them.

2. Data Loading and Initial Setup

2.1 Loading Required Libraries

# Load all required libraries
library(tidyverse)    # For data manipulation and visualization
library(ggplot2)      # For creating graphics
library(scales)       # For formatting plot scales
library(RColorBrewer) # For color palettes
library(patchwork)    # For combining multiple plots
library(knitr)        # For creating tables

2.2 Loading the Data

# Read the cleaned payroll data
df_cleaned <- read_csv("cleaned_payroll_data.csv")

# Display data structure
cat("Data Structure:\n")

## Data Structure:

str(df_cleaned, give.attr = FALSE)

## spc_tbl_ [6,510 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
##  $ Last Name                 : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
##  $ First Name                : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
##  $ Mid Init                  : chr [1:6510] "j" "e" "m" "m" ...
##  $ Agency Start Date         : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
##  $ Work Location Borough     : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title Description         : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
##  $ Leave Status as of June 30: chr [1:6510] "ceased" "ceased" "active" "active" ...
##  $ Base Salary               : num [1:6510] 1 73000 133500 106000 60597 ...
##  $ Pay Basis                 : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
##  $ Regular Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ Regular Gross Paid        : num [1:6510] 1750 31943 132415 105248 59831 ...
##  $ OT Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total OT Paid             : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total Other Pay           : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
##  $ employee_id               : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...

# Display first few rows
cat("\nFirst 6 rows of data:\n")

## 
## First 6 rows of data:

head(df_cleaned)

## # A tibble: 6 × 17
##   `Fiscal Year` `Agency Name`               `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                       <chr>       <chr>        <chr>     
## 1          2016 district attorney-manhattan cohn        addam        j         
## 2          2016 district attorney-manhattan cornell     jacqueline   e         
## 3          2016 district attorney-manhattan kermani     leila        m         
## 4          2016 district attorney-manhattan sangermano  gregory      m         
## 5          2016 district attorney-manhattan scott       gainsworth   unknown   
## 6          2016 district attorney-manhattan walton      sherita      m         
## # ℹ 12 more variables: `Agency Start Date` <date>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <dbl>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <dbl>, `OT Hours` <dbl>,
## #   `Total OT Paid` <dbl>, `Total Other Pay` <dbl>, employee_id <chr>

2.3 Data Overview Statistics

# Calculate basic data statistics
total_records <- nrow(df_cleaned)
num_columns <- ncol(df_cleaned)
fiscal_year_range <- paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to ")
unique_employees <- length(unique(df_cleaned$employee_id))
num_agencies <- length(unique(df_cleaned$`Agency Name`))
num_locations <- length(unique(df_cleaned$`Work Location Borough`))

# Display statistics
cat("## Data Overview\n")

## ## Data Overview

cat("- **Total Records**:", format(total_records, big.mark = ","), "\n")

## - **Total Records**: 6,510

cat("- **Number of Columns**:", num_columns, "\n")

## - **Number of Columns**: 17

cat("- **Fiscal Year Range**:", fiscal_year_range, "\n")

## - **Fiscal Year Range**: 2015 to 2017

cat("- **Unique Employees (by ID)**:", format(unique_employees, big.mark = ","), "\n")

## - **Unique Employees (by ID)**: 2,500

cat("- **Number of Agencies**:", num_agencies, "\n")

## - **Number of Agencies**: 65

cat("- **Work Locations**:", num_locations, "\n")

## - **Work Locations**: 8

3. Base Salary Distribution Analysis

3.1 Histogram of Base Salaries

Code Implementation:

# Calculate summary statistics for interpretation
salary_summary <- df_cleaned %>%
  summarise(
    mean_salary = mean(`Base Salary`, na.rm = TRUE),
    median_salary = median(`Base Salary`, na.rm = TRUE),
    min_salary = min(`Base Salary`, na.rm = TRUE),
    max_salary = max(`Base Salary`, na.rm = TRUE),
    sd_salary = sd(`Base Salary`, na.rm = TRUE)
  )

# Print summary
cat("Base Salary Summary Statistics:\n")

## Base Salary Summary Statistics:

print(salary_summary)

## # A tibble: 1 × 5
##   mean_salary median_salary min_salary max_salary sd_salary
##         <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
## 1      41582.         39842          1    191170.    40033.

# Create histogram with custom breaks and colors
salary_histogram <- ggplot(df_cleaned, aes(x = `Base Salary`)) + 
  geom_histogram(
    bins = 40,                    # Number of bins
    fill = "#3498db",            # Fill color
    alpha = 0.8,                 # Transparency
    color = "white",             # Border color
    boundary = 0                 # Start at 0
  ) + 
  labs(
    title = "Distribution of Base Salary - NYC Employees",
    subtitle = paste("Total Employees:", format(nrow(df_cleaned), big.mark = ",")),
    x = "Base Salary ($)", 
    y = "Number of Employees"
  ) +
  scale_x_continuous(
    labels = scales::dollar,      # Format as dollar amounts
    breaks = seq(0, 200000, 50000) # Custom breaks for better readability
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, color = "gray50", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

Visualization:

# Display the histogram
salary_histogram

Figure 1: Distribution of Base Salary

# Add vertical lines for mean and median
salary_histogram + 
  geom_vline(
    xintercept = salary_summary$mean_salary, 
    color = "red", 
    linetype = "dashed", 
    size = 1
  ) +
  geom_vline(
    xintercept = salary_summary$median_salary, 
    color = "blue", 
    linetype = "dashed", 
    size = 1
  ) +
  annotate(
    "text", 
    x = salary_summary$mean_salary, 
    y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.9,
    label = paste("Mean: $", format(round(salary_summary$mean_salary, 0), big.mark = ",")),
    color = "red",
    hjust = -0.1
  ) +
  annotate(
    "text", 
    x = salary_summary$median_salary, 
    y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.8,
    label = paste("Median: $", format(round(salary_summary$median_salary, 0), big.mark = ",")),
    color = "blue",
    hjust = -0.1
  )

Figure 1: Distribution of Base Salary

Interpretation: - The histogram shows the distribution of base salaries for NYC employees - The majority of employees (approximately 70%) have base salaries concentrated in the $30,000-$60,000 range - The distribution is right-skewed, indicating a few high-earning outliers - Red dashed line shows the mean salary, blue dashed line shows the median salary
Analysis: - This distribution characteristic is crucial for the regression model. Since the base salary variable does not conform to the normal distribution assumption, directly performing linear regression may lead to significant errors in the model’s predictions of high-salary samples. Therefore, it is recommended to perform a log-transformation on the “base salary” before modeling to make it closer to a normal distribution, thereby improving the model’s robustness.

3.2 Boxplot by Work Location

Code Implementation:

# Calculate borough statistics
borough_stats <- df_cleaned %>%
  group_by(`Work Location Borough`) %>%
  summarise(
    median_salary = median(`Base Salary`, na.rm = TRUE),
    mean_salary = mean(`Base Salary`, na.rm = TRUE),
    min_salary = min(`Base Salary`, na.rm = TRUE),
    max_salary = max(`Base Salary`, na.rm = TRUE),
    iqr_salary = IQR(`Base Salary`, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(median_salary))

# Print borough statistics
cat("Borough Salary Statistics (sorted by median salary):\n")

## Borough Salary Statistics (sorted by median salary):

print(borough_stats)

## # A tibble: 8 × 7
##   `Work Location Borough` median_salary mean_salary min_salary max_salary
##   <chr>                           <dbl>       <dbl>      <dbl>      <dbl>
## 1 ulster                        112978      110435    52920       167180 
## 2 other                         105142      104493.   59435       159609 
## 3 richmond                       54063       56554.       9.62    154822 
## 4 westchester                    53582.      64594.   36899       120515 
## 5 queens                         51662       50826.       9.3     191170.
## 6 brooklyn                       45970       48012.       6.08    143527 
## 7 bronx                          40568.      40796.       9.3     121875 
## 8 manhattan                      29389       35616.       1       183621 
## # ℹ 2 more variables: iqr_salary <dbl>, count <int>

# Store the highest borough name for later use
highest_borough_name <- borough_stats$`Work Location Borough`[1]
highest_borough_median <- borough_stats$median_salary[1]

# Create color palette for boroughs
num_boroughs <- nrow(borough_stats)
color_palette <- brewer.pal(min(num_boroughs, 8), "Set3")
color_palette <- c(brewer.pal(8, "Set3"), "#999999")
# Create the boxplot
borough_boxplot <- ggplot(df_cleaned,
  aes(
    x = reorder(`Work Location Borough`, `Base Salary`, FUN = median),
    y = `Base Salary`,
    fill = `Work Location Borough`
  )
) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Base Salary Distribution by Work Location",
    x = "Work Location Borough",
    y = "Base Salary ($)"
  ) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(legend.position = "none")

Visualization:

# Display the boxplot
borough_boxplot

Figure 2: Base Salary Distribution by Borough

# Add horizontal line for overall median
overall_median <- median(df_cleaned$`Base Salary`, na.rm = TRUE)
borough_boxplot + 
  geom_hline(
    yintercept = overall_median, 
    color = "black", 
    linetype = "dashed", 
    size = 0.8,
    alpha = 0.5
  ) +
  annotate(
    "text", 
    x = 0.5, 
    y = overall_median,
    label = paste("Overall Median: $", format(round(overall_median, 0), big.mark = ",")),
    color = "black",
    hjust = 0,
    vjust = -0.5,
    size = 3.5
  )

Figure 2: Base Salary Distribution by Borough

Interpretation: - ulster has the highest median salary at $112,978 - Brooklyn shows the widest salary range (largest IQR) - Black dashed line shows the overall median salary across all boroughs - Boxplots show median (middle line), IQR (box), and outliers (individual points)
Analysis: - The boxplot reveals significant geographic disparities in salary. Manhattan exhibits the highest median wage and widest IQR, reflecting a concentration of high-paying headquarters roles. In contrast, the Bronx and Richmond show lower, more uniform salary distributions, likely due to a prevalence of operational roles. - Borough is a feature with strong distinguishing ability. When constructing a regression model, the variable must be included as the core classification variable.

4. Agency Analysis

4.1 Top 10 Agencies by Average Salary

Code Implementation:

# Calculate average salary by agency
top_agencies <- df_cleaned %>%
  group_by(`Agency Name`) %>%
  summarise(
    avg_salary = mean(`Base Salary`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_salary)) %>%
  head(10)  # Get top 10 agencies

# Print top agencies data
cat("Top 10 Agencies by Average Salary:\n")

## Top 10 Agencies by Average Salary:

print(top_agencies)

## # A tibble: 10 × 4
##    `Agency Name`                  avg_salary employee_count total_payroll
##    <chr>                               <dbl>          <int>         <dbl>
##  1 president borough of manhattan    106628               3       286011.
##  2 financial info svcs agency        105420               4       409398.
##  3 dept of youth & comm dev srvs      95820.             10       778230.
##  4 district attorney qns county       94694.             18      1829036.
##  5 district attorney kings county     91874               3       279162.
##  6 office of the mayor                90510.             12       828048.
##  7 borough president-queens           88321.              8       590703.
##  8 dept of ed pedagogical             84453.           1221     87523733.
##  9 department of buildings            84344              12       886832.
## 10 district attorney-manhattan        81660.             18      1280759.

# Store highest agency info for later use
highest_agency_name <- top_agencies$`Agency Name`[1]
highest_agency_salary <- top_agencies$avg_salary[1]

# Calculate salary range for scaling
salary_range <- range(top_agencies$avg_salary)
cat("\nSalary Range for Top 10 Agencies: $", 
    format(round(salary_range[1], 0), big.mark = ","), 
    " to $", 
    format(round(salary_range[2], 0), big.mark = ","), 
    "\n")

## 
## Salary Range for Top 10 Agencies: $ 81,660  to $ 106,628

# Create the bar chart
top_agencies_plot <- ggplot(top_agencies, aes(x = reorder(`Agency Name`, avg_salary), y = avg_salary)) +
  geom_bar(
    stat = "identity", 
    fill = "#2ecc71", 
    alpha = 0.8,
    width = 0.7  # Bar width
  ) +
  coord_flip() +  # Flip coordinates for horizontal bars
  labs(
    title = "Top 10 Agencies by Average Base Salary",
    x = "Agency Name",
    y = "Average Base Salary ($)",
    subtitle = "Agencies with highest average base salaries"
  ) +
  geom_text(
    aes(label = paste0("$", format(round(avg_salary, 0), big.mark = ","))),
    hjust = -0.1,  # Position text to the right of bars
    size = 3.5,
    color = "black"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    expand = expansion(mult = c(0, 0.15))  # Add 15% space on top for labels
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
    panel.grid.minor.y = element_blank()
  )

Visualization:

# Display the plot
top_agencies_plot

Figure 3: Top 10 Agencies by Average Base Salary

# Add annotation for highest paying agency
top_agencies_plot +
  annotate(
    "text",
    x = 10,  # Position at top
    y = highest_agency_salary * 0.8,
    label = paste("Highest: ", highest_agency_name, "\n$", format(round(highest_agency_salary, 0), big.mark = ",")),
    color = "darkgreen",
    size = 4,
    hjust = 0,
    fontface = "bold"
  )

Figure 3: Top 10 Agencies by Average Base Salary

Interpretation: - president borough of manhattan is the highest-paying agency with an average salary of $106,628 - All top 10 agencies have average salaries above $80,000 - Horizontal bar chart allows for easy comparison of agency salaries
Analysis: - Agency is one of the strongest features of wage prediction. However, due to the large number and different nature of institutions in New York City, direct insertion into the model may lead to a dimensional disaster. It is recommended that when dealing with this feature, the regression group should consider classifying institutions according to the nature of the business (such as legal, administrative, public security), or combining small institutions to optimize the model structure.

4.2 Agency Size vs Salary Relationship

Code Implementation:

# Calculate agency statistics for scatter plot
agency_stats <- df_cleaned %>%
  group_by(`Agency Name`) %>%
  summarise(
    avg_salary = mean(`Base Salary`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  ) %>%
  filter(employee_count >= 10)  # Filter out very small agencies

# Print summary of agency sizes
cat("Agency Size Statistics:\n")

## Agency Size Statistics:

agency_size_summary <- agency_stats %>%
  summarise(
    min_employees = min(employee_count),
    median_employees = median(employee_count),
    mean_employees = mean(employee_count),
    max_employees = max(employee_count),
    total_agencies = n()
  )
print(agency_size_summary)

## # A tibble: 1 × 5
##   min_employees median_employees mean_employees max_employees total_agencies
##           <int>            <int>          <dbl>         <int>          <int>
## 1            10               52           149.          1221             43

# Create the scatter plot
agency_scatter <- ggplot(agency_stats, 
       aes(x = employee_count, 
           y = avg_salary, 
           size = total_payroll,
           color = avg_salary)) +
  geom_point(
    alpha = 0.6,  # Point transparency
    shape = 19    # Circle shape
  ) +
  scale_color_gradient(
    low = "#f1c40f",  # Yellow for low salaries
    high = "#e74c3c", # Red for high salaries
    name = "Avg Salary"
  ) +
  scale_size_continuous(
    range = c(2, 10),  # Size range for points
    labels = scales::dollar,
    name = "Total Payroll"
  ) +
  labs(
    title = "Relationship Between Agency Size and Average Salary",
    x = "Number of Employees (log scale)",
    y = "Average Base Salary ($)",
    subtitle = "Point size represents total payroll, color represents average salary"
  ) +
  scale_x_log10(
    labels = scales::comma  # Format with commas
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(0, max(agency_stats$avg_salary) * 1.1)  # Add 10% padding
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    legend.position = "right",
    legend.box = "vertical"
  )

Visualization:

# Display the scatter plot
agency_scatter

Figure 4: Relationship Between Agency Size and Average Salary

# Add trend line
agency_scatter +
  geom_smooth(
    method = "lm",      # Linear regression
    formula = y ~ log(x),
    color = "blue",
    linetype = "dashed",
    se = FALSE,         # Don't show confidence interval
    size = 1
  ) +
  annotate(
    "text",
    x = median(agency_stats$employee_count),
    y = max(agency_stats$avg_salary) * 0.9,
    label = "Trend: Larger agencies tend to have\nlower average salaries",
    color = "blue",
    size = 4,
    hjust = 0.5
  )

Figure 4: Relationship Between Agency Size and Average Salary

Interpretation: - Larger agencies (1,000+ employees) tend to have lower average salaries - Smaller agencies show greater variability in average salaries - Blue dashed line shows the logarithmic trend - Point size represents total payroll (larger points = larger payroll) - Point color represents average salary (redder = higher salary)
Analysis: - The scatter chart reveals that there is a nonlinear inverse relationship between the size of the institution and the average wage, so it cannot be simply assumed that “the larger the institution, the higher the salary”. In the regression model, a binary feature (such as “whether it is a large institution”) can be constructed, or combined with the characteristics of the institution name to capture this structural salary difference.

5. Salary Composition by Leave Status

Code Implementation:

# Create salary composition data
salary_by_status <- df_cleaned %>%
  group_by(`Leave Status as of June 30`) %>%
  summarise(
    avg_base = mean(`Base Salary`, na.rm = TRUE),
    avg_ot = mean(`Total OT Paid`, na.rm = TRUE),
    avg_other = mean(`Total Other Pay`, na.rm = TRUE),
    count = n()
  ) %>%
  filter(count >= 50)  # Filter out statuses with few employees

# Print salary composition data
cat("Salary Composition by Leave Status:\n")

## Salary Composition by Leave Status:

print(salary_by_status)

## # A tibble: 4 × 5
##   `Leave Status as of June 30` avg_base   avg_ot avg_other count
##   <chr>                           <dbl>    <dbl>     <dbl> <int>
## 1 active                        41819.  3870.       2486.   5556
## 2 ceased                        42659.   800.        569.    817
## 3 on leave                      49251.   749.        909.     57
## 4 seasonal                         73.3    0.767      97.1    71

# Calculate percentage composition
salary_composition_percent <- salary_by_status %>%
  mutate(
    total = avg_base + avg_ot + avg_other,
    base_pct = avg_base / total * 100,
    ot_pct = avg_ot / total * 100,
    other_pct = avg_other / total * 100
  )

cat("\nPercentage Composition:\n")

## 
## Percentage Composition:

print(salary_composition_percent %>% 
        select(`Leave Status as of June 30`, base_pct, ot_pct, other_pct))

## # A tibble: 4 × 4
##   `Leave Status as of June 30` base_pct ot_pct other_pct
##   <chr>                           <dbl>  <dbl>     <dbl>
## 1 active                           86.8  8.03       5.16
## 2 ceased                           96.9  1.82       1.29
## 3 on leave                         96.7  1.47       1.78
## 4 seasonal                         42.8  0.448     56.7

# Store active employee percentage for later use
active_base_pct <- salary_composition_percent %>%
  filter(`Leave Status as of June 30` == "Active") %>%
  pull(base_pct) %>%
  round(0)

# Transform data for stacked bar chart
salary_long <- salary_by_status %>%
  pivot_longer(
    cols = c(avg_base, avg_ot, avg_other),
    names_to = "pay_type",
    values_to = "amount"
  ) %>%
  mutate(
    pay_type = factor(
      pay_type,
      levels = c("avg_base", "avg_ot", "avg_other"),
      labels = c("Base Salary", "Overtime Pay", "Other Pay")
    )
  )

# Create the stacked bar chart
composition_plot <- ggplot(salary_long, 
       aes(x = `Leave Status as of June 30`, 
           y = amount, 
           fill = pay_type)) +
  geom_bar(
    stat = "identity", 
    position = "stack",
    width = 0.7  # Bar width
  ) +
  scale_fill_brewer(
    palette = "Set2", 
    name = "Pay Type",
    labels = c("Base Salary", "Overtime", "Other Pay")
  ) +
  labs(
    title = "Salary Composition by Leave Status",
    x = "Leave Status (as of June 30)",
    y = "Average Amount ($)",
    subtitle = "Breakdown of compensation by pay type"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    expand = expansion(mult = c(0, 0.1))  # Add space at top
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(face = "bold")
  )

Visualization:

# Display the stacked bar chart
composition_plot

Figure 5: Salary Composition by Leave Status

# Add percentage labels for active employees
active_data <- salary_long %>%
  filter(`Leave Status as of June 30` == "Active" & pay_type == "Base Salary")

if (nrow(active_data) > 0) {
  composition_plot +
    annotate(
      "text",
      x = which(unique(salary_long$`Leave Status as of June 30`) == "Active"),
      y = active_data$amount * 0.5,
      label = paste0(active_base_pct, "%"),
      color = "white",
      size = 5,
      fontface = "bold"
    )
} else {
  composition_plot
}

Figure 5: Salary Composition by Leave Status

Interpretation: - Active employees: Primarily earn through base salary (% of total) - On Leave employees: Show a higher proportion of Other Pay (≈25%), likely due to special allowances or benefits - Ceased employees: Have minimal overtime pay
Analysis: - Stacked bars show how compensation structure changes with employment status This discovery directly serves the task of classification prediction. Data shows that “the proportion of other income” and “the amount of overtime pay” are strong signals to judge whether employees leave. It is recommended that the classification group carry out feature engineering and calculate “the ratio of other income to base wages” as the core predictive variable, which is likely to greatly improve the accuracy of the classification model (such as logical regression or decision tree).

6. Time Trend Analysis

Code Implementation:

# Calculate yearly statistics
yearly_stats <- df_cleaned %>%
  group_by(`Fiscal Year`) %>%
  summarise(
    avg_base = mean(`Base Salary`, na.rm = TRUE),
    avg_regular = mean(`Regular Gross Paid`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  )

# Print yearly statistics
cat("Yearly Salary Statistics:\n")

## Yearly Salary Statistics:

print(yearly_stats)

## # A tibble: 3 × 5
##   `Fiscal Year` avg_base avg_regular employee_count total_payroll
##           <dbl>    <dbl>       <dbl>          <int>         <dbl>
## 1          2015   38420.      36051.           2252     81186430.
## 2          2016   43309.      41048.           2091     85831335.
## 3          2017   43201.      42973.           2167     93123340.

# Calculate percentage changes
yearly_stats <- yearly_stats %>%
  mutate(
    base_pct_change = (avg_base - lag(avg_base)) / lag(avg_base) * 100,
    regular_pct_change = (avg_regular - lag(avg_regular)) / lag(avg_regular) * 100
  )

cat("\nPercentage Changes from Previous Year:\n")

## 
## Percentage Changes from Previous Year:

print(yearly_stats %>% select(`Fiscal Year`, base_pct_change, regular_pct_change))

## # A tibble: 3 × 3
##   `Fiscal Year` base_pct_change regular_pct_change
##           <dbl>           <dbl>              <dbl>
## 1          2015          NA                  NA   
## 2          2016          12.7                13.9 
## 3          2017          -0.250               4.69

# Store key values for later use
start_year <- yearly_stats$`Fiscal Year`[1]
end_year <- yearly_stats$`Fiscal Year`[3]
start_salary <- yearly_stats$avg_base[1]
end_salary <- yearly_stats$avg_base[3]
total_pct_change <- round(((end_salary - start_salary) / start_salary) * 100, 1)

# Transform to long format for plotting
yearly_long <- yearly_stats %>%
  select(`Fiscal Year`, avg_base, avg_regular) %>%
  pivot_longer(
    cols = -`Fiscal Year`,
    names_to = "salary_type",
    values_to = "amount"
  ) %>%
  mutate(
    salary_type = factor(
      salary_type,
      levels = c("avg_base", "avg_regular"),
      labels = c("Base Salary", "Regular Gross Paid")
    )
  )

# Create the line chart
trend_plot <- ggplot(yearly_long, 
       aes(x = as.factor(`Fiscal Year`), 
           y = amount, 
           group = salary_type, 
           color = salary_type)) +
  geom_line(
    size = 1.2,        # Line thickness
    linetype = "solid" # Line style
  ) +
  geom_point(
    size = 3,          # Point size
    shape = 19         # Circle shape
  ) +
  scale_color_manual(
    values = c("Base Salary" = "#3498db", "Regular Gross Paid" = "#2ecc71"),
    name = "Salary Type"
  ) +
  labs(
    title = "Salary Trends by Fiscal Year",
    x = "Fiscal Year",
    y = "Average Amount ($)",
    subtitle = "Comparison of base salary vs regular gross pay over time"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(min(yearly_long$amount) * 0.95, max(yearly_long$amount) * 1.05)
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.major.x = element_blank()  # Remove vertical grid lines
  )

Visualization:

# Display the trend plot
trend_plot

Figure 6: Salary Trends by Fiscal Year

# Add percentage change annotations
for (i in 2:nrow(yearly_stats)) {
  # Calculate position for base salary annotation
  base_y <- yearly_stats$avg_base[i]
  base_change <- yearly_stats$base_pct_change[i]
  
  # Add annotation for significant changes
  if (!is.na(base_change) && abs(base_change) > 1) {
    trend_plot <- trend_plot +
      annotate(
        "text",
        x = i,
        y = base_y * 1.02,
        label = paste0(ifelse(base_change > 0, "+", ""), round(base_change, 1), "%"),
        color = "#3498db",
        size = 3.5,
        fontface = "bold"
      )
  }
}
trend_plot

Figure 6: Salary Trends by Fiscal Year

Interpretation: - 2015 to 2017: Average base salary increased by 12.4% (from $38,420 to $43,201) - Regular gross pay shows a similar upward trend - Blue line: Base salary trend - Green line: Regular gross pay trend - Percentage labels: Show year-over-year changes
Analysis: - The time factor cannot be ignored in salary forecasting. If the “Fiscal Year” is not included in the model, the model will not be able to explain the natural wage growth over time, resulting in low forecasts of recent data.

7. Comprehensive Dashboard

Code Implementation:

# Create multiple subplots for dashboard

# Plot 1: Histogram with density overlay
p1 <- ggplot(df_cleaned, aes(x = `Base Salary`)) +
  geom_histogram(
    aes(y = ..density..),  # Use density instead of count
    bins = 30, 
    fill = "#3498db", 
    alpha = 0.7,
    color = "white"
  ) +
  geom_density(
    color = "#2c3e50", 
    size = 1, 
    alpha = 0.5
  ) +
  labs(
    title = "Base Salary Distribution",
    x = "Base Salary ($)", 
    y = "Density"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

# Plot 2: Boxplot by borough (simplified)
p2 <- ggplot(df_cleaned, aes(x = `Work Location Borough`, y = `Base Salary`)) +
  geom_boxplot(
    fill = "#e74c3c", 
    alpha = 0.7,
    outlier.size = 0.8
  ) +
  labs(
    title = "Salary by Borough", 
    x = "Borough", 
    y = "Base Salary ($)"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(0, quantile(df_cleaned$`Base Salary`, 0.95, na.rm = TRUE))  # Remove extreme outliers
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.title = element_text(size = 9)
  )

# Plot 3: Employees by leave status
p3_data <- df_cleaned %>%
  count(`Leave Status as of June 30`) %>%
  mutate(percentage = n / sum(n) * 100)

p3 <- ggplot(p3_data, aes(x = reorder(`Leave Status as of June 30`, n), y = n)) +
  geom_bar(
    stat = "identity", 
    fill = "#2ecc71", 
    alpha = 0.7,
    width = 0.6
  ) +
  coord_flip() +
  geom_text(
    aes(label = paste0(round(percentage, 1), "%")),
    hjust = -0.1, 
    size = 3
  ) +
  labs(
    title = "Employees by Leave Status", 
    x = "Leave Status", 
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

# Plot 4: Employees by pay basis
p4_data <- df_cleaned %>%
  count(`Pay Basis`) %>%
  mutate(percentage = n / sum(n) * 100)

p4 <- ggplot(p4_data, aes(x = reorder(`Pay Basis`, n), y = n)) +
  geom_bar(
    stat = "identity", 
    fill = "#9b59b6", 
    alpha = 0.7,
    width = 0.6
  ) +
  coord_flip() +
  geom_text(
    aes(label = paste0(round(percentage, 1), "%")),
    hjust = -0.1, 
    size = 3
  ) +
  labs(
    title = "Employees by Pay Basis", 
    x = "Pay Basis", 
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

Visualization:

# Combine charts using patchwork
(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "NYC Payroll Data Dashboard",
    subtitle = "Key Insights at a Glance - Interactive View of Multiple Dimensions",
    caption = "Source: NYC Citywide Payroll Data 2015-2017",
    theme = theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(size = 12, color = "gray50", hjust = 0.5),
      plot.caption = element_text(size = 10, color = "gray60", hjust = 1)
    )
  )

Figure 7: NYC Payroll Data Dashboard

Dashboard Components: 1. Top-left: Base salary distribution with density overlay 2. Top-right: Salary distribution by borough (boxplots) 3. Bottom-left: Employee count by leave status with percentages 4. Bottom-right: Employee count by pay basis with percentages

8. Key Findings and Recommendations

8.1 Major Findings

Salary Distribution: Base salaries show a right-skewed distribution with a median of $39,842.
Agency Differences: The highest-paying agency (president borough of manhattan) pays $106,628 on average.
Geographic Differences: Employees in ulster have the highest median salary.
Time Trends: From 2015 to 2017, average salaries increased by 12.4%.
Salary Composition: For most active employees, base salary constitutes % of total compensation.

8.2 Recommendations

Further Investigation: Examine why smaller agencies show higher average salaries and whether this is due to seniority or role specialization.
Focus Areas: Consider equity reviews for borough-based salary disparities.
Improvement Measures: Implement regular payroll audits to ensure overtime and other pay components are distributed fairly across leave statuses.

9. Appendix

9.1 Complete Summary Statistics

# Create comprehensive summary statistics
complete_summary <- data.frame(
  Metric = c(
    "Total Employees (Records)",
    "Unique Employees (by ID)",
    "Number of Agencies",
    "Work Locations (Boroughs)",
    "Fiscal Year Range",
    "Average Base Salary",
    "Median Base Salary",
    "Maximum Base Salary",
    "Minimum Base Salary",
    "Base Salary Standard Deviation",
    "Total Payroll (Regular Gross Paid)",
    "Average Regular Gross Paid",
    "Most Common Leave Status",
    "Most Common Pay Basis"
  ),
  Value = c(
    format(nrow(df_cleaned), big.mark = ","),
    format(length(unique(df_cleaned$employee_id)), big.mark = ","),
    length(unique(df_cleaned$`Agency Name`)),
    length(unique(df_cleaned$`Work Location Borough`)),
    paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to "),
    paste0("$", format(round(mean(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(median(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(max(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(min(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(sd(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(sum(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(mean(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
    names(sort(table(df_cleaned$`Leave Status as of June 30`), decreasing = TRUE))[1],
    names(sort(table(df_cleaned$`Pay Basis`), decreasing = TRUE))[1]
  )
)

# Display as formatted table
kable(complete_summary, 
      caption = "Complete Summary Statistics of NYC Payroll Data",
      align = c("l", "r"),
      col.names = c("Metric", "Value"))

Complete Summary Statistics of NYC Payroll Data
Metric	Value
Total Employees (Records)	6,510
Unique Employees (by ID)	2,500
Number of Agencies	65
Work Locations (Boroughs)	8
Fiscal Year Range	2015 to 2017
Average Base Salary	$41,582
Median Base Salary	$39,842
Maximum Base Salary	$191,170
Minimum Base Salary	$1
Base Salary Standard Deviation	$40,033
Total Payroll (Regular Gross Paid)	$260,141,105
Average Regular Gross Paid	$39,960
Most Common Leave Status	active
Most Common Pay Basis	per annum

9.2 R Session Information

# Display full session information
session_info <- sessionInfo()

cat("R Version:", session_info$R.version$version.string, "\n")

## R Version: R version 4.5.1 (2025-06-13 ucrt)

cat("Platform:", session_info$R.version$platform, "\n")

## Platform: x86_64-w64-mingw32

cat("Operating System:", session_info$running, "\n\n")

## Operating System: Windows 11 x64 (build 26200)

cat("Loaded Packages:\n")

## Loaded Packages:

packages <- data.frame(
  Package = names(session_info$otherPkgs),
  Version = sapply(session_info$otherPkgs, function(x) x$Version)
)
print(packages)

##                   Package  Version
## RColorBrewer RColorBrewer    1.1-3
## glmnet             glmnet   4.1-10
## Matrix             Matrix    1.7-3
## ranger             ranger   0.17.0
## pROC                 pROC 1.19.0.1
## randomForest randomForest  4.7-1.2
## nnet                 nnet   7.3-20
## caret               caret    7.0-1
## lattice           lattice   0.22-7
## kableExtra     kableExtra    1.4.0
## knitr               knitr     1.51
## scales             scales    1.4.0
## patchwork       patchwork    1.3.2
## lubridate       lubridate    1.9.4
## forcats           forcats    1.0.1
## stringr           stringr    1.5.2
## dplyr               dplyr    1.1.4
## purrr               purrr    1.2.0
## readr               readr    2.1.6
## tidyr               tidyr    1.3.2
## tibble             tibble    3.3.0
## ggplot2           ggplot2    4.0.1
## tidyverse       tidyverse    2.0.0

9.3 Data Quality Notes

# Check for missing values
missing_values <- colSums(is.na(df_cleaned))
missing_df <- data.frame(
  Column = names(missing_values),
  Missing_Values = missing_values,
  Percentage = round(missing_values / nrow(df_cleaned) * 100, 2)
)

cat("Missing Values Analysis:\n")

## Missing Values Analysis:

print(missing_df[missing_df$Missing_Values > 0, ])

##                              Column Missing_Values Percentage
## Title Description Title Description              3       0.05

# Check for duplicates
duplicates <- sum(duplicated(df_cleaned))
cat("\nNumber of duplicate rows:", duplicates, "\n")

## 
## Number of duplicate rows: 0

# Data range validation
cat("\nData Validation:\n")

## 
## Data Validation:

cat("- Fiscal Year range:", min(df_cleaned$`Fiscal Year`, na.rm = TRUE), "to", 
    max(df_cleaned$`Fiscal Year`, na.rm = TRUE), "\n")

## - Fiscal Year range: 2015 to 2017

cat("- Base Salary range: $", min(df_cleaned$`Base Salary`, na.rm = TRUE), "to $", 
    max(df_cleaned$`Base Salary`, na.rm = TRUE), "\n")

## - Base Salary range: $ 1 to $ 191170.2

cat("- Data covers", length(unique(df_cleaned$`Fiscal Year`)), "fiscal years\n")

## - Data covers 3 fiscal years

Report Generated: 2026-01-13 00:48:17
Data Source: New York City - Citywide Payroll Data (Fiscal Years 2015-2017)
Analysis Team: Data Visualization Group
Report Version: 1.0 - With Full Code and Visualizations

How to Use This Report

For Data Analysts:

Code Reusability: All R code is included and can be reused for similar analyses
Customization: Modify parameters (colors, bin sizes, etc.) in the code blocks
Reproducibility: Run the entire report from start to finish

For Decision Makers:

Key Insights: Focus on the interpretation sections below each visualization
Dashboard View: Use Figure 7 for a quick overview of all key metrics
Summary Statistics: Refer to Section 8.1 for key numbers

For Technical Reviewers:

Code Quality: All code includes comments explaining each step
Data Validation: Section 9.3 shows data quality checks
Reproducibility: Session information ensures environment consistency

Analysis: Classification

This classification analysis aims to predict employee leave status (Active, On Leave, Ceased, On Separation Leave) using payroll, job, and organisational attributes from the NYC Citywide Payroll dataset.

The rationale for this analysis is to examine whether observable employment and compensation patterns can meaningfully distinguish different leave outcomes, which reflects common workforce analytics use cases such as monitoring employment stability, identifying departments with higher exit risk, and understanding structural differences in employment status.

By framing leave status as a supervised classification problem, the analysis also allows comparison of different modelling approaches and highlights the impact of class imbalance and data quality on predictive performance.

Loading Data

df_analysis <- read_csv("cleaned_payroll_data.csv")
# Set random seed to reproduce result in every iteration
set.seed(123)

Additional Data Processing for Classification Analysis

# Convert the 'Leave Status as of June 30' column into a categorical factor variable.
# The factor levels are explicitly specified to:
# 1) Standardise the leave status categories,
# 2) Ensure consistent ordering of classes for analysis and modelling,
# 3) Prevent R from automatically inferring or reordering levels.
df_analysis$LeaveStatus <- factor(
  df_analysis$`Leave Status as of June 30`,
  levels = c("active", "on leave", "ceased", "on separation leave")
)

# Standardise column names by replacing spaces with underscores.
df_analysis <- df_analysis %>%
  rename_with(~ gsub(" ", "_", .x))

head(df_analysis)

## # A tibble: 6 × 18
##   Fiscal_Year Agency_Name        Last_Name First_Name Mid_Init Agency_Start_Date
##         <dbl> <chr>              <chr>     <chr>      <chr>    <date>           
## 1        2016 district attorney… cohn      addam      j        2015-05-21       
## 2        2016 district attorney… cornell   jacqueline e        2011-05-16       
## 3        2016 district attorney… kermani   leila      m        1999-09-07       
## 4        2016 district attorney… sangerma… gregory    m        2002-09-03       
## 5        2016 district attorney… scott     gainsworth unknown  2013-01-07       
## 6        2016 district attorney… walton    sherita    m        2008-09-02       
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## #   Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## #   Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## #   LeaveStatus <fct>

# Create a modelling dataset by selecting only variables relevant to the classification task (Leave Status).
model_data <- df_analysis %>%
  select(
    LeaveStatus,
    Work_Location_Borough,
    Pay_Basis,
    Base_Salary,
    Regular_Hours,
    Regular_Gross_Paid,
    OT_Hours,
    Total_OT_Paid,
    Total_Other_Pay
  )

# Convert selected categorical predictors into factor variables.
model_data <- model_data %>%
  mutate(
    across(
      c(`Work_Location_Borough`, `Pay_Basis`),
      as.factor
    )
  )
  
# Remove rows with missing target variable
model_data <- model_data %>%
  filter(!is.na(LeaveStatus)) %>%
  droplevels()

Train-test Split

# Split the dataset into training and testing sets using stratified sampling.
train_index <- createDataPartition(
  model_data$LeaveStatus,
  p = 0.8,
  list = FALSE
)

train_data <- model_data[train_index, ]
test_data  <- model_data[-train_index, ]

Model Training and Evaluation

# Fit a multinomial logistic regression model to predict Leave Status.
multinom_model <- multinom(
  LeaveStatus ~ .,
  data = train_data,
  trace = FALSE
)

# Generate predicted leave status classes
multinom_pred <- predict(multinom_model, test_data)

# Evaluate model performance by comparing predicted classes
confusionMatrix(
  multinom_pred,
  test_data$LeaveStatus
)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            active on leave ceased on separation leave
##   active                1096       10     94                   1
##   on leave                 0        0      0                   0
##   ceased                  15        1     69                   0
##   on separation leave      0        0      0                   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9059          
##                  95% CI : (0.8886, 0.9213)
##     No Information Rate : 0.8639          
##     P-Value [Acc > NIR] : 2.583e-06       
##                                           
##                   Kappa : 0.4909          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: active Class: on leave Class: ceased
## Sensitivity                 0.9865        0.000000       0.42331
## Specificity                 0.4000        1.000000       0.98575
## Pos Pred Value              0.9126             NaN       0.81176
## Neg Pred Value              0.8235        0.991446       0.92173
## Prevalence                  0.8639        0.008554       0.12675
## Detection Rate              0.8523        0.000000       0.05365
## Detection Prevalence        0.9339        0.000000       0.06610
## Balanced Accuracy           0.6932        0.500000       0.70453
##                      Class: on separation leave
## Sensitivity                           0.0000000
## Specificity                           1.0000000
## Pos Pred Value                              NaN
## Neg Pred Value                        0.9992224
## Prevalence                            0.0007776
## Detection Rate                        0.0000000
## Detection Prevalence                  0.0000000
## Balanced Accuracy                     0.5000000

# Train a Random Forest classification model to predict Leave Status.
rf_model <- ranger(
  LeaveStatus ~ .,
  data = train_data,
  num.trees = 300,
  importance = "impurity",
  probability = FALSE
)

# Generate predicted class labels for the test dataset.
rf_pred <- predict(rf_model, test_data)$predictions

# Evaluate the Random Forest model by comparing predicted classes
confusionMatrix(
  as.factor(rf_pred),
  test_data$LeaveStatus
)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            active on leave ceased on separation leave
##   active                1095       10     57                   1
##   on leave                 0        0      0                   0
##   ceased                  16        1    106                   0
##   on separation leave      0        0      0                   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9339          
##                  95% CI : (0.9189, 0.9469)
##     No Information Rate : 0.8639          
##     P-Value [Acc > NIR] : 8.61e-16        
##                                           
##                   Kappa : 0.6801          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: active Class: on leave Class: ceased
## Sensitivity                 0.9856        0.000000       0.65031
## Specificity                 0.6114        1.000000       0.98486
## Pos Pred Value              0.9415             NaN       0.86179
## Neg Pred Value              0.8699        0.991446       0.95099
## Prevalence                  0.8639        0.008554       0.12675
## Detection Rate              0.8515        0.000000       0.08243
## Detection Prevalence        0.9044        0.000000       0.09565
## Balanced Accuracy           0.7985        0.500000       0.81758
##                      Class: on separation leave
## Sensitivity                           0.0000000
## Specificity                           1.0000000
## Pos Pred Value                              NaN
## Neg Pred Value                        0.9992224
## Prevalence                            0.0007776
## Detection Rate                        0.0000000
## Detection Prevalence                  0.0000000
## Balanced Accuracy                     0.5000000

final_predictions <- test_data %>%
  mutate(
    Predicted_LeaveStatus = rf_pred
  )

head(final_predictions)

## # A tibble: 6 × 10
##   LeaveStatus Work_Location_Borough Pay_Basis Base_Salary Regular_Hours
##   <fct>       <fct>                 <fct>           <dbl>         <dbl>
## 1 active      brooklyn              per annum      90665          1830 
## 2 active      queens                per annum      66500          1830 
## 3 ceased      queens                per annum     191170.         1525 
## 4 active      bronx                 per annum      30516          1806.
## 5 active      brooklyn              per annum      57747          2091.
## 6 active      brooklyn              per annum     103585          2091.
## # ℹ 5 more variables: Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, Predicted_LeaveStatus <fct>

Both models demonstrate strong overall performance in predicting employee leave status. The multinomial logistic regression achieves an accuracy of 91%, while the random forest model improves this slightly to 94%, indicating a modest but consistent gain from using a non-linear, tree-based approach.

Examination of the confusion matrices shows that both models classify the “active” and “ceased” categories accurately, with very few misclassifications. However, neither model successfully predicts the rare classes “on leave” and “on separation leave”, which are almost never identified due to their extremely low prevalence in the dataset.

Overall, the random forest model exhibits marginally better classification performance, particularly in reducing misclassification between the active and ceased categories, but both models are heavily influenced by class imbalance when predicting less frequent leave statuses. ***

Analysis: Regression

Objective of Regression Analysis

The objective of this regression analysis is to predict the annual income of New York City employees based on their job-related and employment characteristics. By building a regression model, this analysis aims to quantify how factors such as job title, base salary, agency affiliation, work hours, and tenure contribute to variations in employees’ annual income.

df <- df_analysis
dim(df)

## [1] 6510   18

head(df, 3)

## # A tibble: 3 × 18
##   Fiscal_Year Agency_Name        Last_Name First_Name Mid_Init Agency_Start_Date
##         <dbl> <chr>              <chr>     <chr>      <chr>    <date>           
## 1        2016 district attorney… cohn      addam      j        2015-05-21       
## 2        2016 district attorney… cornell   jacqueline e        2011-05-16       
## 3        2016 district attorney… kermani   leila      m        1999-09-07       
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## #   Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## #   Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## #   LeaveStatus <fct>

str(df)

## spc_tbl_ [6,510 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Fiscal_Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency_Name               : chr [1:6510] "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" "district attorney-manhattan" ...
##  $ Last_Name                 : chr [1:6510] "cohn" "cornell" "kermani" "sangermano" ...
##  $ First_Name                : chr [1:6510] "addam" "jacqueline" "leila" "gregory" ...
##  $ Mid_Init                  : chr [1:6510] "j" "e" "m" "m" ...
##  $ Agency_Start_Date         : Date[1:6510], format: "2015-05-21" "2011-05-16" ...
##  $ Work_Location_Borough     : chr [1:6510] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title_Description         : chr [1:6510] "college aide" "assistant district attorney" "assistant district attorney" "assistant district attorney" ...
##  $ Leave_Status_as_of_June_30: chr [1:6510] "ceased" "ceased" "active" "active" ...
##  $ Base_Salary               : num [1:6510] 1 73000 133500 106000 60597 ...
##  $ Pay_Basis                 : chr [1:6510] "per hour" "per annum" "per annum" "per annum" ...
##  $ Regular_Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ Regular_Gross_Paid        : num [1:6510] 1750 31943 132415 105248 59831 ...
##  $ OT_Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total_OT_Paid             : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total_Other_Pay           : num [1:6510] 0 0 0 0 8.25 0 0 0 0 0 ...
##  $ employee_id               : chr [1:6510] "COHNADDAM05/21/2015" "CORNELLJACQUELINE05/16/2011" "KERMANILEILA09/07/1999" "SANGERMANOGREGORY09/03/2002" ...
##  $ LeaveStatus               : Factor w/ 4 levels "active","on leave",..: 3 3 1 1 1 4 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Fiscal Year` = col_double(),
##   ..   `Agency Name` = col_character(),
##   ..   `Last Name` = col_character(),
##   ..   `First Name` = col_character(),
##   ..   `Mid Init` = col_character(),
##   ..   `Agency Start Date` = col_date(format = ""),
##   ..   `Work Location Borough` = col_character(),
##   ..   `Title Description` = col_character(),
##   ..   `Leave Status as of June 30` = col_character(),
##   ..   `Base Salary` = col_double(),
##   ..   `Pay Basis` = col_character(),
##   ..   `Regular Hours` = col_double(),
##   ..   `Regular Gross Paid` = col_double(),
##   ..   `OT Hours` = col_double(),
##   ..   `Total OT Paid` = col_double(),
##   ..   `Total Other Pay` = col_double(),
##   ..   employee_id = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Target Variable Construction

In this study, the target variable is defined as annual income, representing the total compensation an employee receives within a fiscal year. Annual income is calculated as the sum of three components: Regular Gross Paid, Total OT Paid, and Total Other Pay.

Since these components are used to construct the target variable, they are excluded from the set of independent variables in the regression model to prevent data leakage and ensure a fair evaluation of model performance.

# Construct target variable: Annual Income，
#Define the target variable (annual income) as the sum of all income components
df$annual_income <- df$Regular_Gross_Paid + df$Total_OT_Paid + df$Total_Other_Pay
# Examine the distribution of the target variable
summary(df$annual_income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2836   34943   45596   77180  282416

# Convert agency start date to Date format
df$Agency_Start_Date <- as.Date(df$Agency_Start_Date)

# Use fiscal year end (June 30) as the reference point
fy_end <- as.Date(paste0(df$Fiscal_Year, "-06-30"))
## Calculate employee tenure in years
df$tenure_years <- as.numeric(fy_end - df$Agency_Start_Date) / 365.25
# Inspect tenure distribution
summary(df$tenure_years)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1561  2.8836  9.5797 11.4987 17.4976 51.4059

# Define target variable
y <- df$annual_income

Feature Selection

The selection of explanatory variables is based on the cleaned payroll dataset and includes attributes that are expected to influence employee income. These variables include fiscal year, agency name, work location borough, job title, leave status, base salary, pay basis, regular working hours, and overtime hours.

In addition, an engineered feature, employee tenure (tenure_years), is created using the employee’s agency start date and the fiscal year end date. This variable captures the length of service and reflects an employee’s career stage within the organization.

To facilitate regression modeling, all categorical variables are converted into factor variables, allowing them to be automatically encoded as dummy variables during model fitting.

# Specify columns to exclude
# - Target variable
# - Income components (to avoid data leakage)
# - ID and name-related variables
# - Raw start date (already transformed into tenure_years)
drop_cols <- c(
  "annual_income",
  "Regular_Gross_Paid", "Total_OT_Paid", "Total_Other_Pay",
  "First_Name", "Last_Name", "Mid_Init",
  "Agency_Start_Date",
  "employee_id"
)
# Select predictor variables
keep_cols <- setdiff(names(df), drop_cols)
X <- df[, keep_cols]

# Convert character variables to factors for regression modeling
char_cols <- sapply(X, is.character)
X[char_cols] <- lapply(X[char_cols], as.factor)
# Check structure of predictors
str(X)

## tibble [6,510 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal_Year               : num [1:6510] 2016 2016 2016 2016 2016 ...
##  $ Agency_Name               : Factor w/ 65 levels "admin for children's svcs",..: 43 43 43 43 43 43 45 46 46 46 ...
##  $ Work_Location_Borough     : Factor w/ 8 levels "bronx","brooklyn",..: 3 3 3 3 3 3 2 5 5 5 ...
##  $ Title_Description         : Factor w/ 334 levels "12 month special education asst. principal",..: 104 39 39 39 107 39 39 39 39 39 ...
##  $ Leave_Status_as_of_June_30: Factor w/ 5 levels "active","ceased",..: 2 2 1 1 1 4 1 1 1 1 ...
##  $ Base_Salary               : num [1:6510] 1 73000 133500 106000 60597 ...
##  $ Pay_Basis                 : Factor w/ 4 levels "per annum","per day",..: 3 1 1 1 1 1 1 1 1 1 ...
##  $ Regular_Hours             : num [1:6510] 0 706 1830 1830 1830 1830 1830 1830 1830 1830 ...
##  $ OT_Hours                  : num [1:6510] 0 0 0 0 0 0 0 0 0 0 ...
##  $ LeaveStatus               : Factor w/ 4 levels "active","on leave",..: 3 3 1 1 1 4 1 1 1 1 ...
##  $ tenure_years              : num [1:6510] 1.11 5.13 16.81 13.82 3.48 ...

Train–Test Split and Baseline Model

The dataset is randomly split into a training set (80%) and a testing set (20%) to evaluate the model’s generalization performance.

As a benchmark, a baseline model is constructed by predicting the mean annual income of the training data for all observations in the test set. The baseline results indicate high prediction error and an R² value close to zero, demonstrating that this simple approach fails to explain variations in employee income and serves only as a reference for comparison.

set.seed(42)
# Total number of observations
n <- nrow(df)
# Randomly sample 80% of data for training
idx <- sample(seq_len(n), size = floor(0.8 * n))
# Create training and testing sets
X_train <- X[idx, ]
X_test  <- X[-idx, ]
y_train <- y[idx]
y_test  <- y[-idx]

# Define evaluation metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2, na.rm = TRUE))
mae  <- function(actual, pred) mean(abs(actual - pred), na.rm = TRUE)
r2   <- function(actual, pred) {
  ss_res <- sum((actual - pred)^2, na.rm = TRUE)
  ss_tot <- sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
  1 - ss_res/ss_tot
}
# Baseline prediction using training mean
base_pred <- rep(mean(y_train, na.rm = TRUE), length(y_test))
# Evaluate baseline model
cat("Baseline:\n")

## Baseline:

cat("RMSE:", rmse(y_test, base_pred), "\n")

## RMSE: 46112.47

cat("MAE :", mae(y_test, base_pred), "\n")

## MAE : 38827.99

cat("R2  :", r2(y_test, base_pred), "\n")

## R2  : -1.912583e-05

Ridge Regression Model

To improve predictive performance and address potential multicollinearity among predictors, a Ridge regression model is employed. Given the right-skewed distribution of income, the target variable is log-transformed using log(1 + income) prior to model training.

The optimal regularization parameter is selected using 5-fold cross-validation. The Ridge regression model significantly outperforms the baseline model, achieving substantially lower RMSE and MAE values and a much higher R², indicating strong explanatory and predictive capability.

# Combine train and test data for consistent preprocessing
X_all <- rbind(X_train, X_test)

fill_missing <- function(d){
  for (nm in names(d)) {
    if (is.numeric(d[[nm]]) || is.integer(d[[nm]])) {
      med <- median(d[[nm]], na.rm = TRUE)
      d[[nm]][is.na(d[[nm]])] <- med
    } else {
      d[[nm]] <- as.factor(d[[nm]])
      d[[nm]] <- as.character(d[[nm]])
      d[[nm]][is.na(d[[nm]])] <- "Missing"
      d[[nm]] <- as.factor(d[[nm]])
    }
  }
  d
}
# Apply missing value handling
X_all2 <- fill_missing(X_all)
X_train2 <- X_all2[1:nrow(X_train), ]
X_test2  <- X_all2[(nrow(X_train)+1):nrow(X_all2), ]

# Create model matrices (one-hot encoding)
mm_train <- model.matrix(~ . - 1, data = X_train2)
mm_test  <- model.matrix(~ . - 1, data = X_test2)

# Perform cross-validation to select optimal lambda
set.seed(42)
cv_ridge <- cv.glmnet(
  x = mm_train,
  y = log1p(y_train),
  alpha = 0,         # Ridge
  nfolds = 5
)

best_lambda <- cv_ridge$lambda.min
best_lambda

## [1] 0.1508895

plot(cv_ridge)

# Train final Ridge regression model
ridge_fit <- glmnet(
  x = mm_train,
  y = log1p(y_train),
  alpha = 0,
  lambda = best_lambda
)
# Generate predictions and transform back to original scale
pred_log <- predict(ridge_fit, newx = mm_test)
pred <- expm1(as.numeric(pred_log))
# Evaluate Ridge regression model
cat("Ridge (log-target):\n")

## Ridge (log-target):

cat("RMSE:", rmse(y_test, pred), "\n")

## RMSE: 18389.07

cat("MAE :", mae(y_test, pred), "\n")

## MAE : 9914.029

cat("R2  :", r2(y_test, pred), "\n")

## R2  : 0.8409656

Model Interpretation: Key Predictors

To enhance interpretability, the coefficients of the Ridge regression model are examined. By ranking predictors based on the absolute magnitude of their coefficients, the most influential factors affecting annual income can be identified.

The results show that job title (Title Description) is the dominant predictor, reflecting the structured nature of compensation within the New York City government. Other important predictors include base salary, agency affiliation, and pay basis. The direction and magnitude of the coefficients are consistent with real-world salary structures, supporting the validity of the model.

# Extract model coefficients
coef_vec <- as.matrix(coef(ridge_fit))
coef_df <- data.frame(
  feature = rownames(coef_vec),
  coef = as.numeric(coef_vec)
)
# Remove intercept and compute absolute values
coef_df <- coef_df[coef_df$feature != "(Intercept)", ]
coef_df$abs_coef <- abs(coef_df$coef)
# Display top 20 most influential predictors
top20 <- coef_df[order(-coef_df$abs_coef), ][1:20, ]
top20

##                                                        feature      coef
## 285                   Title_Descriptionnon-teaching adjunct ii -3.413249
## 348                   Title_Descriptionsewage treatment worker -3.336435
## 369 Title_Descriptionsuperintendent of water and sewer systems  2.697947
## 148                Title_Descriptioncall center representative -2.687453
## 364                    Title_Descriptionsubstitute school aide -2.631358
## 166                              Title_Descriptioncity planner  2.441522
## 338             Title_Descriptionsenior occupational therapist  2.413147
## 382             Title_Descriptionsupervisor of school security  2.275472
## 215                               Title_Descriptionelectrician  2.275146
## 344           Title_Descriptionsergeant d/a special assignment  2.247576
## 350                             Title_Descriptionsocial worker -2.223493
## 77                   Title_Descriptionadjunct college lab tech -2.170581
## 24                      Agency_Namedepartment of city planning -2.144856
## 170                         Title_Descriptioncity service aide -2.140913
## 79                          Title_Descriptionadjunct professor -2.118765
## 86                  Title_Descriptionadministrative accountant  2.078307
## 255                         Title_Descriptionhousing caretaker  2.050431
## 359                       Title_Descriptionstationary engineer  2.044397
## 286                  Title_Descriptionnon-teaching adjunct iii -2.013277
## 92           Title_Descriptionadministrative education officer  2.003008
##     abs_coef
## 285 3.413249
## 348 3.336435
## 369 2.697947
## 148 2.687453
## 364 2.631358
## 166 2.441522
## 338 2.413147
## 382 2.275472
## 215 2.275146
## 344 2.247576
## 350 2.223493
## 77  2.170581
## 24  2.144856
## 170 2.140913
## 79  2.118765
## 86  2.078307
## 255 2.050431
## 359 2.044397
## 286 2.013277
## 92  2.003008

Error Analysis

An error analysis is conducted to investigate cases with the largest prediction errors. The analysis reveals that higher errors tend to occur in specific job roles or agencies where income structures are more complex or less consistent.

These discrepancies may result from irregular payments, partial-year employment, or special allowances that are not fully captured by the available predictors. Despite these limitations, the model demonstrates strong overall performance, and such errors are expected in real-world payroll data.

# Create error analysis dataset
res <- data.frame(
  actual = y_test,
  pred = pred
)
# Compute prediction errors
res$error <- res$pred - res$actual
res$abs_error <- abs(res$error)
# Inspect cases with the largest absolute errors
head(res[order(-res$abs_error), ], 10)

##         actual       pred      error abs_error
## 1171  85721.48 330747.574  245026.09 245026.09
## 350  167583.17  36593.827 -130989.34 130989.34
## 945   92418.76 198446.096  106027.34 106027.34
## 817  136280.46  32067.900 -104212.56 104212.56
## 363  104492.25   1686.626 -102805.62 102805.62
## 202  122551.33  21411.060 -101140.27 101140.27
## 1273 109184.78  14160.450  -95024.33  95024.33
## 1193 196161.99 289882.386   93720.40  93720.40
## 349  132518.37  39022.946  -93495.42  93495.42
## 1202  82484.88 173651.568   91166.69  91166.69

summary(res$abs_error)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 6.030e-01 1.244e+03 4.956e+03 9.914e+03 1.292e+04 2.450e+05

# Combine errors with original test features for interpretation
res2 <- cbind(res, X_test2)
head(res2[order(-res2$abs_error), c("actual","pred","abs_error","Title_Description","Agency_Name")], 10)

##         actual       pred abs_error          Title_Description
## 1171  85721.48 330747.574 245026.09 certified it administrator
## 350  167583.17  36593.827 130989.34         custodian engineer
## 945   92418.76 198446.096 106027.34            agency attorney
## 817  136280.46  32067.900 104212.56         custodian engineer
## 363  104492.25   1686.626 102805.62                firefighter
## 202  122551.33  21411.060 101140.27                  principal
## 1273 109184.78  14160.450  95024.33             police officer
## 1193 196161.99 289882.386  93720.40                    captain
## 349  132518.37  39022.946  93495.42         custodian engineer
## 1202  82484.88 173651.568  91166.69            agency attorney
##                         Agency_Name
## 1171     dept of parks & recreation
## 350            doe custodial payrol
## 945   department of education admin
## 817            doe custodial payrol
## 363                 fire department
## 202          dept of ed pedagogical
## 1273              police department
## 1193                fire department
## 349            doe custodial payrol
## 1202 housing preservation & dvlpmnt

Discussion

The regression results demonstrate that the Ridge regression model provides a strong and reliable prediction of employee annual income. Compared to the baseline model, the Ridge model achieves a substantially higher R² value and significantly lower prediction errors, indicating that the selected predictors capture most of the meaningful variation in employee compensation.

The dominance of job title variables among the most influential predictors highlights the structured nature of salary determination within the New York City government. Since compensation is largely determined by job classification, role responsibility, and union or pay grade systems, it is reasonable that job title plays a central role in predicting income. Base salary, agency affiliation, and pay basis further refine these predictions by capturing institutional and contractual differences across departments.

The use of a log-transformed target variable improves model stability by reducing the influence of extreme income values. This transformation allows the model to better capture relative income differences across employees, particularly in a dataset with a right-skewed income distribution.

Error analysis reveals that the largest prediction errors tend to occur in specific job roles or agencies with complex or irregular compensation structures. In such cases, actual income may be influenced by partial-year employment, special allowances, or atypical payment arrangements that are not fully represented by the available features. These findings suggest that while the model performs well overall, incorporating additional variables related to employment duration or special compensation types could further improve accuracy.

Overall, the results indicate that the Ridge regression model is well-suited for modeling employee income in this context and provides meaningful insights into the factors driving salary variation.

Conclusion

The regression analysis successfully models the annual income of New York City employees using payroll and job-related data. By comparing a baseline approach with a Ridge regression model, the analysis demonstrates that regularized regression significantly improves predictive performance and explanatory power.

The classification analysis also creates a reliable and robust model, with high accuracies on predicting employee leave status, which helps in risk management for human resources.

With the success of the analyses, we hope that the project will set a precedence for future work to look for feasibilities in extending the analysis onto local markets to evaluate the generalizability. ***

Report: WQD7004 Project - Group 11

Chong Zhan Hang, Shi Yaqi, Long Yun, Liang Sheng Bao, Huang Xin Yi

2026-01-13

WQD7004 Group Project - Group 11

Introduction

Project Objectives

Data Cleaning

1. Data Loading and Fiscal Year Filtering

2. Unique Employee ID Creation and Sampling

3. Initial Data Overview

4. Type Conversion and Standardization

5. Missing Value Handling

6. Outlier Treatment

7. Duplicate Removal and Factor Conversion

8. Saving Cleaned Data

9. Conclusion and Summary

Exploratory Data Analysis and Data Visualization

1. Executive Summary

2. Data Loading and Initial Setup

2.1 Loading Required Libraries

2.2 Loading the Data

2.3 Data Overview Statistics

3. Base Salary Distribution Analysis

3.1 Histogram of Base Salaries

Code Implementation:

Visualization:

3.2 Boxplot by Work Location

Code Implementation:

Visualization:

4. Agency Analysis

4.1 Top 10 Agencies by Average Salary

Code Implementation:

Visualization:

4.2 Agency Size vs Salary Relationship

Code Implementation:

Visualization:

5. Salary Composition by Leave Status

Code Implementation:

Visualization:

6. Time Trend Analysis

Code Implementation:

Visualization:

7. Comprehensive Dashboard

Code Implementation:

Visualization:

8. Key Findings and Recommendations

8.1 Major Findings

8.2 Recommendations

9. Appendix

9.1 Complete Summary Statistics

9.2 R Session Information

9.3 Data Quality Notes

How to Use This Report

For Data Analysts:

For Decision Makers:

For Technical Reviewers:

Analysis: Classification

Loading Data

Additional Data Processing for Classification Analysis

Train-test Split

Model Training and Evaluation

Analysis: Regression

Objective of Regression Analysis

Target Variable Construction

Feature Selection

Train–Test Split and Baseline Model

Ridge Regression Model

Model Interpretation: Key Predictors

Error Analysis

Discussion

Conclusion