WQD7004 Group Project - Group 11

Team Members

Chong Zhan Hang (24077643)

Shi Yaqi (24227144)

Long Yun (24213952)

Liang Shengbao (24200736)

Huang Xinyi (25058110)

Introduction

This project applies data science techniques to a large-scale, real-world administrative dataset to demonstrate data cleaning, exploratory analysis, and predictive modelling using R.

The NYC Citywide Payroll dataset was selected due to its size, complexity, and inherent data quality issues, which closely reflect practical challenges encountered in organisational and public-sector analytics.

The dataset contains detailed employee, payroll, and job-related information across multiple city agencies, making it well suited for both regression and classification problems.

Through this project, the group aims to transform raw, noisy payroll data into a structured analytical dataset and extract meaningful insights related to employee compensation and employment status.

Project Objectives

The objectives of this project are as follows:

To explore and understand the structure, content, and limitations of the NYC Citywide Payroll dataset through systematic data inspection.
To perform comprehensive data cleaning and preprocessing, including handling missing values, standardising categorical variables, and resolving inconsistencies in administrative records.
To present findings through clear visualisation, model evaluation metrics, and concise interpretation suitable for decision-making contexts.
To develop a regression model to analyse factors influencing employee compensation using payroll and job-related attributes.
To build and evaluate classification models to predict employee leave status based on employment and compensation characteristics.
To compare the performance of different modelling approaches.

Data Cleaning

1. Data Loading and Fiscal Year Filtering

This section covers loading the raw CSV data and filtering it to include only records from the latest three consecutive fiscal years. Load the original CSV dataset and filter for records from the latest three consecutive fiscal years.

R Code

library(readr)
library(dplyr)

# Load the dataset
df <- read_csv('Citywide_Payroll_Data__Fiscal_Year_.csv', show_col_types = FALSE)

# Identify and filter for the latest three fiscal years
unique_fiscal_years <- unique(df$`Fiscal Year`)
sorted_fiscal_years <- sort(unique_fiscal_years, decreasing = TRUE)
latest_three_fiscal_years <- head(sorted_fiscal_years, 3)
df_filtered <- df %>% filter(`Fiscal Year` %in% latest_three_fiscal_years)

cat(paste("Original data loaded and filtered for latest three fiscal years:", toString(latest_three_fiscal_years), "\n"))

## Original data loaded and filtered for latest three fiscal years: 2017, 2016, 2015

head(df_filtered)

## # A tibble: 6 × 16
##   `Fiscal Year` `Agency Name`               `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                       <chr>       <chr>        <chr>     
## 1          2016 DISTRICT ATTORNEY-MANHATTAN ABA'AHMID   RAHASHEEM    E         
## 2          2016 DISTRICT ATTORNEY-MANHATTAN ABENSUR     MARGARET     <NA>      
## 3          2016 DISTRICT ATTORNEY-MANHATTAN ABOUNAOUM   ANDREA       L         
## 4          2016 DISTRICT ATTORNEY-MANHATTAN ABRAHAM     JONATHAN     J         
## 5          2016 DISTRICT ATTORNEY-MANHATTAN ABRAMS      JOSEPH       <NA>      
## 6          2016 DISTRICT ATTORNEY-MANHATTAN ABREU       JENNIFER     <NA>      
## # ℹ 11 more variables: `Agency Start Date` <chr>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## #   `Total OT Paid` <chr>, `Total Other Pay` <chr>

Explanation First, the necessary libraries readr and dplyr are loaded. The Citywide_Payroll_Data__Fiscal_Year_.csv file is then loaded into a dataframe named df. To identify the latest three fiscal years, unique fiscal years are extracted, sorted in descending order, and the top three are selected. Finally, df_filtered is created by filtering the original df to include only records corresponding to these three latest fiscal years. This ensures that our analysis focuses on the most recent data.

2. Unique Employee ID Creation and Sampling

This section details the creation of unique employee identifiers and the subsequent structured sampling of the dataset. Create an approximately unique employee_id by combining ‘Last Name’, ‘First Name’, and ‘Agency Start Date’. Then, randomly sample 2500 unique employee_ids from the filtered data.

R Code

df_filtered <- df_filtered %>%
  mutate(
    `Last Name` = ifelse(is.na(`Last Name`), "", as.character(`Last Name`)),
    `First Name` = ifelse(is.na(`First Name`), "", as.character(`First Name`)),
    `Agency Start Date` = ifelse(is.na(`Agency Start Date`), "", as.character(`Agency Start Date`)),
    employee_id = paste0(`Last Name`, `First Name`, `Agency Start Date`)
  )

set.seed(12345)

unique_employee_ids <- unique(df_filtered$employee_id)
sampled_employee_ids <- sample(unique_employee_ids, min(2500, length(unique_employee_ids)))
df_sampled <- df_filtered %>% filter(employee_id %in% sampled_employee_ids)

cat(paste("Number of unique employee IDs created:", length(unique_employee_ids), "\n"))

## Number of unique employee IDs created: 642096

cat(paste("Number of employee IDs sampled:", length(sampled_employee_ids), "\n"))

## Number of employee IDs sampled: 2500

cat(paste("Number of records after sampling:", nrow(df_sampled), "\n"))

## Number of records after sampling: 6518

head(df_sampled)

## # A tibble: 6 × 17
##   `Fiscal Year` `Agency Name`                `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                        <chr>       <chr>        <chr>     
## 1          2016 DISTRICT ATTORNEY-MANHATTAN  "CONCANNON" "CONNOR"     M         
## 2          2016 DISTRICT ATTORNEY-MANHATTAN  "KIM"       "SU KYOUNG"  <NA>      
## 3          2016 POLICE DEPARTMENT            "ADME"      "NICOT"      <NA>      
## 4          2016 DISTRICT ATTORNEY-MANHATTAN  "WALEN"     "WILLIAMS"   W         
## 5          2016 DISTRICT ATTORNEY-MANHATTAN  "YULDASHEV" "ANVAR"      <NA>      
## 6          2016 DISTRICT ATTORNEY KINGS COU… ""          ""           <NA>      
## # ℹ 12 more variables: `Agency Start Date` <chr>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <chr>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <chr>, `OT Hours` <dbl>,
## #   `Total OT Paid` <chr>, `Total Other Pay` <chr>, employee_id <chr>

Explanation An employee_id is created by concatenating Last Name, First Name, and Agency Start Date to provide a pseudo-unique identifier for each employee. Missing values in these columns are replaced with empty strings before concatenation to avoid NA values in employee_id. From the df_filtered dataset, all unique employee_ids are extracted, and a random sample of 2500 (or fewer if less than 2500 unique IDs exist) is selected. Finally, df_sampled is created by filtering df_filtered to include only records belonging to these sampled employee_ids, ensuring a representative subset for further cleaning and analysis.

3. Initial Data Overview

This section provides a preliminary overview of the sampled dataset. Perform an initial overview of df_sampled using R functions like str(), dim(), and summary() to check row count, column types, and basic statistics.

R Code

str(df_sampled)

## tibble [6,518 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6518] "DISTRICT ATTORNEY-MANHATTAN" "DISTRICT ATTORNEY-MANHATTAN" "POLICE DEPARTMENT" "DISTRICT ATTORNEY-MANHATTAN" ...
##  $ Last Name                 : chr [1:6518] "CONCANNON" "KIM" "ADME" "WALEN" ...
##  $ First Name                : chr [1:6518] "CONNOR" "SU KYOUNG" "NICOT" "WILLIAMS" ...
##  $ Mid Init                  : chr [1:6518] "M" NA NA "W" ...
##  $ Agency Start Date         : chr [1:6518] "01/04/2012" "07/23/2012" "01/10/2005" "09/09/2013" ...
##  $ Work Location Borough     : chr [1:6518] "MANHATTAN" "MANHATTAN" "MANHATTAN" "MANHATTAN" ...
##  $ Title Description         : chr [1:6518] "COMMUNITY COORDINATOR" "COMMUNITY COORDINATOR" "SERGEANT" "COMMUNITY ASSOCIATE" ...
##  $ Leave Status as of June 30: chr [1:6518] "ACTIVE" "ACTIVE" "ACTIVE" "ACTIVE" ...
##  $ Base Salary               : chr [1:6518] "$82000.00" "$84694.00" "$103585.00" "$41402.00" ...
##  $ Pay Basis                 : chr [1:6518] "per Annum" "per Annum" "per Annum" "per Annum" ...
##  $ Regular Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ Regular Gross Paid        : chr [1:6518] "$80311.48" "$83784.44" "$90639.84" "$40957.32" ...
##  $ OT Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ Total OT Paid             : chr [1:6518] "$5738.95" "$0.00" "$22381.82" "$666.93" ...
##  $ Total Other Pay           : chr [1:6518] "$502.90" "$0.00" "$13749.69" "$10.77" ...
##  $ employee_id               : chr [1:6518] "CONCANNONCONNOR01/04/2012" "KIMSU KYOUNG07/23/2012" "ADMENICOT01/10/2005" "WALENWILLIAMS09/09/2013" ...

dim(df_sampled)

## [1] 6518   17

summary(df_sampled)

##   Fiscal Year   Agency Name         Last Name          First Name       
##  Min.   :2015   Length:6518        Length:6518        Length:6518       
##  1st Qu.:2015   Class :character   Class :character   Class :character  
##  Median :2016   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2017                                                           
##  Max.   :2017                                                           
##    Mid Init         Agency Start Date  Work Location Borough Title Description 
##  Length:6518        Length:6518        Length:6518           Length:6518       
##  Class :character   Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character   Mode  :character      Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##  Leave Status as of June 30 Base Salary         Pay Basis        
##  Length:6518                Length:6518        Length:6518       
##  Class :character           Class :character   Class :character  
##  Mode  :character           Mode  :character   Mode  :character  
##                                                                  
##                                                                  
##                                                                  
##  Regular Hours     Regular Gross Paid    OT Hours       Total OT Paid     
##  Min.   : -49.15   Length:6518        Min.   :   0.00   Length:6518       
##  1st Qu.:   0.00   Class :character   1st Qu.:   0.00   Class :character  
##  Median :   0.00   Mode  :character   Median :   0.00   Mode  :character  
##  Mean   : 641.11                      Mean   :  58.04                     
##  3rd Qu.:1825.00                      3rd Qu.:   0.00                     
##  Max.   :2232.87                      Max.   :1280.00                     
##  Total Other Pay    employee_id       
##  Length:6518        Length:6518       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Explanation To understand the structure and basic characteristics of the df_sampled dataframe, str() is used to inspect the data types of each column, dim() provides the dimensions (number of rows and columns), and summary() generates descriptive statistics for each column, including min, max, mean, median, and quartile values for numerical columns, and frequency counts for categorical ones. This step is crucial for identifying potential issues like incorrect data types or initial data distributions.

4. Type Conversion and Standardization

This section details the conversion of data types and standardization of text fields. Convert currency/pay columns to numeric, date columns to date type, handle negative values in salary/pay columns by setting them to 0, and standardize categorical text columns.

R Code

df_cleaned <- df_sampled %>%
  mutate(
    # Convert currency columns to numeric
    `Base Salary` = as.numeric(gsub("\\$|,", "", `Base Salary`)),
    `Regular Gross Paid` = as.numeric(gsub("\\$|,|", "", `Regular Gross Paid`)),
    `Total OT Paid` = as.numeric(gsub("\\$|,|", "", `Total OT Paid`)),
    `Total Other Pay` = as.numeric(gsub("\\$|,|", "", `Total Other Pay`)),

    # Convert Agency Start Date to date type
    `Agency Start Date` = as.Date(`Agency Start Date`, format = "%m/%d/%Y")
  ) %>%
  mutate(
    # Replace negative values with 0 in salary/pay columns and Regular Hours
    `Base Salary` = ifelse(`Base Salary` < 0, 0, `Base Salary`),
    `Regular Gross Paid` = ifelse(`Regular Gross Paid` < 0, 0, `Regular Gross Paid`),
    `Total OT Paid` = ifelse(`Total OT Paid` < 0, 0, `Total OT Paid`),
    `Total Other Pay` = ifelse(`Total Other Pay` < 0, 0, `Total Other Pay`),
    `Regular Hours` = ifelse(`Regular Hours` < 0, 0, `Regular Hours`)
  ) %>%
  mutate(
    # Standardize categorical text columns
    `Agency Name` = tolower(trimws(`Agency Name`)),
    `Last Name` = tolower(trimws(`Last Name`)),
    `First Name` = tolower(trimws(`First Name`)),
    `Mid Init` = tolower(trimws(`Mid Init`)),
    `Work Location Borough` = tolower(trimws(`Work Location Borough`)),
    `Title Description` = tolower(trimws(`Title Description`)),
    `Pay Basis` = tolower(trimws(`Pay Basis`)),
    `Leave Status as of June 30` = tolower(trimws(`Leave Status as of June 30`))
  )
cat("Data types converted and text columns standardized.\n")

## Data types converted and text columns standardized.

str(df_cleaned)

## tibble [6,518 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6518] "district attorney-manhattan" "district attorney-manhattan" "police department" "district attorney-manhattan" ...
##  $ Last Name                 : chr [1:6518] "concannon" "kim" "adme" "walen" ...
##  $ First Name                : chr [1:6518] "connor" "su kyoung" "nicot" "williams" ...
##  $ Mid Init                  : chr [1:6518] "m" NA NA "w" ...
##  $ Agency Start Date         : Date[1:6518], format: "2012-01-04" "2012-07-23" ...
##  $ Work Location Borough     : chr [1:6518] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title Description         : chr [1:6518] "community coordinator" "community coordinator" "sergeant" "community associate" ...
##  $ Leave Status as of June 30: chr [1:6518] "active" "active" "active" "active" ...
##  $ Base Salary               : num [1:6518] 82000 84694 103585 41402 269 ...
##  $ Pay Basis                 : chr [1:6518] "per annum" "per annum" "per annum" "per annum" ...
##  $ Regular Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ Regular Gross Paid        : num [1:6518] 80311 83784 90640 40957 74665 ...
##  $ OT Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ Total OT Paid             : num [1:6518] 5739 0 22382 667 14904 ...
##  $ Total Other Pay           : num [1:6518] 502.9 0 13749.7 10.8 9107 ...
##  $ employee_id               : chr [1:6518] "CONCANNONCONNOR01/04/2012" "KIMSU KYOUNG07/23/2012" "ADMENICOT01/10/2005" "WALENWILLIAMS09/09/2013" ...

summary(df_cleaned)

##   Fiscal Year   Agency Name         Last Name          First Name       
##  Min.   :2015   Length:6518        Length:6518        Length:6518       
##  1st Qu.:2015   Class :character   Class :character   Class :character  
##  Median :2016   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2016                                                           
##  3rd Qu.:2017                                                           
##  Max.   :2017                                                           
##    Mid Init         Agency Start Date    Work Location Borough
##  Length:6518        Min.   :1965-09-10   Length:6518          
##  Class :character   1st Qu.:1998-08-31   Class :character     
##  Mode  :character   Median :2006-11-20   Mode  :character     
##                     Mean   :2004-11-08                        
##                     3rd Qu.:2013-04-25                        
##                     Max.   :2017-06-19                        
##  Title Description  Leave Status as of June 30  Base Salary       
##  Length:6518        Length:6518                Min.   :     1.00  
##  Class :character   Class :character           1st Qu.:    33.18  
##  Mode  :character   Mode  :character           Median : 39370.00  
##                                                Mean   : 41243.25  
##                                                3rd Qu.: 76286.00  
##                                                Max.   :207432.00  
##   Pay Basis         Regular Hours    Regular Gross Paid    OT Hours      
##  Length:6518        Min.   :   0.0   Min.   :     0     Min.   :   0.00  
##  Class :character   1st Qu.:   0.0   1st Qu.:  3551     1st Qu.:   0.00  
##  Mode  :character   Median :   0.0   Median : 34108     Median :   0.00  
##                     Mean   : 641.1   Mean   : 39871     Mean   :  58.04  
##                     3rd Qu.:1825.0   3rd Qu.: 69742     3rd Qu.:   0.00  
##                     Max.   :2232.9   Max.   :204496     Max.   :1280.00  
##  Total OT Paid      Total Other Pay employee_id       
##  Min.   :     0.0   Min.   :    0   Length:6518       
##  1st Qu.:     0.0   1st Qu.:    0   Class :character  
##  Median :     0.0   Median :    0   Mode  :character  
##  Mean   :  3203.0   Mean   : 2066                     
##  3rd Qu.:   128.9   3rd Qu.: 1080                     
##  Max.   :143616.0   Max.   :33618

Explanation This section converts critical columns to their appropriate data types and standardizes text. Currency symbols ($) and commas (,) are removed from financial columns (Base Salary, Regular Gross Paid, Total OT Paid, Total Other Pay), and these columns are converted to numeric type. The Agency Start Date column is converted from character to date format. Negative values in Base Salary, Regular Gross Paid, Total OT Paid, Total Other Pay, and Regular Hours are logically set to 0. Finally, various categorical text columns (e.g., Agency Name, Last Name, First Name) are converted to lowercase and leading/trailing whitespace is removed to ensure consistency and facilitate future analysis.

5. Missing Value Handling

This section addresses missing values in specific columns of the dataset. Fill missing values in ‘Work Location Borough’ using the mode, and fill missing values (including empty strings) in ‘Mid Init’, ‘Last Name’, and ‘First Name’ with ‘unknown’.

R Code

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

mode_work_location_borough <- get_mode(df_cleaned$`Work Location Borough`)
df_cleaned <- df_cleaned %>%
  mutate(`Work Location Borough` = ifelse(is.na(`Work Location Borough`), mode_work_location_borough, `Work Location Borough`))

df_cleaned <- df_cleaned %>%
  mutate(
    `Mid Init` = ifelse(is.na(`Mid Init`) | `Mid Init` == "", "unknown", `Mid Init`),
    `Last Name` = ifelse(is.na(`Last Name`) | `Last Name` == "", "unknown", `Last Name`),
    `First Name` = ifelse(is.na(`First Name`) | `First Name` == "", "unknown", `First Name`)
  )
cat("Missing values in specified columns have been handled.\n")

## Missing values in specified columns have been handled.

cat("NA count in Work Location Borough: ", sum(is.na(df_cleaned$`Work Location Borough`)), "\n")

## NA count in Work Location Borough:  0

cat("Empty string count in Work Location Borough: ", sum(df_cleaned$`Work Location Borough` == ""), "\n")

## Empty string count in Work Location Borough:  0

cat("NA count in Mid Init: ", sum(is.na(df_cleaned$`Mid Init`)), "\n")

## NA count in Mid Init:  0

cat("Empty string count in Mid Init: ", sum(df_cleaned$`Mid Init` == ""), "\n")

## Empty string count in Mid Init:  0

cat("NA count in Last Name: ", sum(is.na(df_cleaned$`Last Name`)), "\n")

## NA count in Last Name:  0

cat("Empty string count in Last Name: ", sum(df_cleaned$`Last Name` == ""), "\n")

## Empty string count in Last Name:  0

cat("NA count in First Name: ", sum(is.na(df_cleaned$`First Name`)), "\n")

## NA count in First Name:  0

cat("Empty string count in First Name: ", sum(df_cleaned$`First Name` == ""), "\n")

## Empty string count in First Name:  0

Explanation Missing values are handled to ensure data completeness. For Work Location Borough, a custom get_mode function is defined to calculate the mode (most frequent value), which is then used to impute any NA values in this column. For Mid Init, Last Name, and First Name, both NA values and empty strings are replaced with the placeholder "unknown". This comprehensive imputation ensures no critical categorical data points are left missing.

6. Outlier Treatment

This section describes the identification and treatment of outliers in numerical columns. Apply IQR capping to ‘Base Salary’ while explicitly skipping ‘OT Hours’, ‘Total OT Paid’, and ‘Total Other Pay’.

R Code

cap_outliers_iqr <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  upper_bound <- q3 + 1.5 * iqr
  lower_bound <- q1 - 1.5 * iqr
  x[x > upper_bound] <- upper_bound
  x[x < lower_bound] <- lower_bound
  return(x)
}

cat("Summary of Base Salary BEFORE IQR Capping:\n")

## Summary of Base Salary BEFORE IQR Capping:

print(summary(df_cleaned$`Base Salary`))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00     33.18  39370.00  41243.25  76286.00 207432.00

df_cleaned$`Base Salary` <- cap_outliers_iqr(df_cleaned$`Base Salary`)
cat("\nSummary of Base Salary AFTER IQR Capping:\n")

## 
## Summary of Base Salary AFTER IQR Capping:

print(summary(df_cleaned$`Base Salary`))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00     33.18  39370.00  41239.06  76286.00 190665.23

cat("IQR capping applied to 'Base Salary'. Other specified columns were skipped.\n")

## IQR capping applied to 'Base Salary'. Other specified columns were skipped.

Explanation Outliers in Base Salary were capped using the Interquartile Range (IQR) method. Values falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR were replaced with the respective bounds. This approach mitigates the impact of extreme values on statistical analyses. Other potentially skewed financial columns, such as OT Hours, Total OT Paid, and Total Other Pay, were intentionally excluded from capping. This decision was made because high values in these categories might represent legitimate compensation (e.g., extensive overtime or bonuses) and blanket capping could distort their true variability, which is best handled during the analysis phase (e.g., via log transformations) to preserve information.

7. Duplicate Removal and Factor Conversion

This section focuses on ensuring data uniqueness and converting categorical variables to factor types. Remove all duplicate rows from df_cleaned and convert ‘Agency Name’, ‘Work Location Borough’, ‘Leave Status as of June 30’, ‘Pay Basis’, and ‘Title Description’ to factor types.

R Code

cat(paste("Number of records BEFORE removing duplicates:", nrow(df_cleaned), "\n"))

## Number of records BEFORE removing duplicates: 6518

df_cleaned <- df_cleaned %>%
    distinct()
cat(paste("Number of records AFTER removing duplicates:", nrow(df_cleaned), "\n"))

## Number of records AFTER removing duplicates: 6518

df_cleaned <- df_cleaned %>%
  mutate(
    `Agency Name` = as.factor(`Agency Name`),
    `Work Location Borough` = as.factor(`Work Location Borough`),
    `Leave Status as of June 30` = as.factor(`Leave Status as of June 30`),
    `Pay Basis` = as.factor(`Pay Basis`),
    `Title Description` = as.factor(`Title Description`)
  )
cat("Duplicate rows removed and specified columns converted to factor types.\n")

## Duplicate rows removed and specified columns converted to factor types.

str(df_cleaned)

## tibble [6,518 × 17] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : Factor w/ 66 levels "admin for children's svcs",..: 41 41 64 41 41 43 43 43 43 44 ...
##  $ Last Name                 : chr [1:6518] "concannon" "kim" "adme" "walen" ...
##  $ First Name                : chr [1:6518] "connor" "su kyoung" "nicot" "williams" ...
##  $ Mid Init                  : chr [1:6518] "m" "unknown" "unknown" "w" ...
##  $ Agency Start Date         : Date[1:6518], format: "2012-01-04" "2012-07-23" ...
##  $ Work Location Borough     : Factor w/ 10 levels "bronx","brooklyn",..: 5 5 5 5 5 2 2 2 2 7 ...
##  $ Title Description         : Factor w/ 342 levels "*housing caretaker",..: 106 106 274 105 179 248 42 105 42 105 ...
##  $ Leave Status as of June 30: Factor w/ 5 levels "active","ceased",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Base Salary               : num [1:6518] 82000 84694 103585 41402 269 ...
##  $ Pay Basis                 : Factor w/ 4 levels "per annum","per day",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ Regular Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ Regular Gross Paid        : num [1:6518] 80311 83784 90640 40957 74665 ...
##  $ OT Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ Total OT Paid             : num [1:6518] 5739 0 22382 667 14904 ...
##  $ Total Other Pay           : num [1:6518] 502.9 0 13749.7 10.8 9107 ...
##  $ employee_id               : chr [1:6518] "CONCANNONCONNOR01/04/2012" "KIMSU KYOUNG07/23/2012" "ADMENICOT01/10/2005" "WALENWILLIAMS09/09/2013" ...

Explanation To maintain data integrity and prevent biased analyses, all completely duplicate rows were removed from df_cleaned. Following this, key categorical variables such as Agency Name, Work Location Borough, Leave Status as of June 30, Pay Basis, and Title Description were explicitly converted to R’s factor type. This conversion optimizes memory usage for categorical data and prepares these variables for statistical modeling and visualization functions that often require factor inputs.

8. Saving Cleaned Data

This final section details the saving of the processed dataset for future use. Save the cleaned dataset to new CSV and RData files.

R Code

write_csv(df_cleaned, "cleaned_payroll_data.csv")
save(df_cleaned, file = "cleaned_payroll_data.RData")
cat("Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.\n")

## Cleaned data saved as 'cleaned_payroll_data.csv' and 'cleaned_payroll_data.RData'.

Explanation The thoroughly cleaned and preprocessed df_cleaned dataset is saved in two formats: as a CSV file (cleaned_payroll_data.csv) and as an RData file (cleaned_payroll_data.RData). Saving in both formats ensures accessibility and ease of use for different R workflows and for sharing with other team members or tools. This step marks the completion of the data cleaning phase, providing a refined dataset ready for subsequent visualization, analysis, and model building.

9. Conclusion and Summary

The entire data cleaning process for the New York City - Citywide Payroll Data has been completed, transforming the raw dataset into a high-quality, analysis-ready format. The key steps undertaken include:

Structured Sampling: The dataset was filtered to include records from the latest three fiscal years (2015-2017). A sample of 2500 unique employees was drawn, resulting in a working dataset of 6538 records, ensuring longitudinal data integrity for attrition prediction.
Data Type Conversion and Standardization: Monetary and date columns were correctly converted. Negative values in all salary, pay, and regular hours columns were logically set to zero. Categorical text fields were standardized to lowercase and trimmed for consistency.
Missing Value Handling: Missing Work Location Borough values were imputed using the mode. Missing or empty strings in Mid Init, Last Name, and First Name were filled with "unknown", crucial for maintaining employee_id integrity.
Outlier Treatment: Outliers in Base Salary were capped using the IQR method to mitigate extreme values’ impact. OT Hours, Total OT Paid, and Total Other Pay were intentionally skipped to preserve their inherent variability.
Duplicate Removal and Factor Conversion: Duplicate rows were checked and removed (none found after sampling). Key categorical variables were converted to R factor types, optimizing them for statistical modeling and analysis.

Outcome: The df_cleaned dataset is now robust, consistent, and free from common data quality issues. It has been saved as cleaned_payroll_data.csv and cleaned_payroll_data.RData and is fully prepared for subsequent data visualization, in-depth analysis, and model building by the respective team members.

Exploratory Data Analysis and Data Visualization

1. Executive Summary

This report provides a comprehensive visualization analysis of New York City payroll data over the last three fiscal years (2015-2017). The report includes both the visualizations and the R code used to generate them.

2. Data Loading and Initial Setup

2.1 Loading Required Libraries

# Load all required libraries
library(tidyverse)    # For data manipulation and visualization
library(ggplot2)      # For creating graphics
library(scales)       # For formatting plot scales
library(RColorBrewer) # For color palettes
library(patchwork)    # For combining multiple plots
library(knitr)        # For creating tables

2.2 Loading the Data

# Read the cleaned payroll data
df_cleaned <- read_csv("cleaned_payroll_data.csv")

# Display data structure
cat("Data Structure:\n")

## Data Structure:

str(df_cleaned, give.attr = FALSE)

## spc_tbl_ [6,518 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Fiscal Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency Name               : chr [1:6518] "district attorney-manhattan" "district attorney-manhattan" "police department" "district attorney-manhattan" ...
##  $ Last Name                 : chr [1:6518] "concannon" "kim" "adme" "walen" ...
##  $ First Name                : chr [1:6518] "connor" "su kyoung" "nicot" "williams" ...
##  $ Mid Init                  : chr [1:6518] "m" "unknown" "unknown" "w" ...
##  $ Agency Start Date         : Date[1:6518], format: "2012-01-04" "2012-07-23" ...
##  $ Work Location Borough     : chr [1:6518] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title Description         : chr [1:6518] "community coordinator" "community coordinator" "sergeant" "community associate" ...
##  $ Leave Status as of June 30: chr [1:6518] "active" "active" "active" "active" ...
##  $ Base Salary               : num [1:6518] 82000 84694 103585 41402 269 ...
##  $ Pay Basis                 : chr [1:6518] "per annum" "per annum" "per annum" "per annum" ...
##  $ Regular Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ Regular Gross Paid        : num [1:6518] 80311 83784 90640 40957 74665 ...
##  $ OT Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ Total OT Paid             : num [1:6518] 5739 0 22382 667 14904 ...
##  $ Total Other Pay           : num [1:6518] 502.9 0 13749.7 10.8 9107 ...
##  $ employee_id               : chr [1:6518] "CONCANNONCONNOR01/04/2012" "KIMSU KYOUNG07/23/2012" "ADMENICOT01/10/2005" "WALENWILLIAMS09/09/2013" ...

# Display first few rows
cat("\nFirst 6 rows of data:\n")

## 
## First 6 rows of data:

head(df_cleaned)

## # A tibble: 6 × 17
##   `Fiscal Year` `Agency Name`                `Last Name` `First Name` `Mid Init`
##           <dbl> <chr>                        <chr>       <chr>        <chr>     
## 1          2016 district attorney-manhattan  concannon   connor       m         
## 2          2016 district attorney-manhattan  kim         su kyoung    unknown   
## 3          2016 police department            adme        nicot        unknown   
## 4          2016 district attorney-manhattan  walen       williams     w         
## 5          2016 district attorney-manhattan  yuldashev   anvar        unknown   
## 6          2016 district attorney kings cou… unknown     unknown      unknown   
## # ℹ 12 more variables: `Agency Start Date` <date>,
## #   `Work Location Borough` <chr>, `Title Description` <chr>,
## #   `Leave Status as of June 30` <chr>, `Base Salary` <dbl>, `Pay Basis` <chr>,
## #   `Regular Hours` <dbl>, `Regular Gross Paid` <dbl>, `OT Hours` <dbl>,
## #   `Total OT Paid` <dbl>, `Total Other Pay` <dbl>, employee_id <chr>

2.3 Data Overview Statistics

# Calculate basic data statistics
total_records <- nrow(df_cleaned)
num_columns <- ncol(df_cleaned)
fiscal_year_range <- paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to ")
unique_employees <- length(unique(df_cleaned$employee_id))
num_agencies <- length(unique(df_cleaned$`Agency Name`))
num_locations <- length(unique(df_cleaned$`Work Location Borough`))

# Display statistics
cat("## Data Overview\n")

## ## Data Overview

cat("- **Total Records**:", format(total_records, big.mark = ","), "\n")

## - **Total Records**: 6,518

cat("- **Number of Columns**:", num_columns, "\n")

## - **Number of Columns**: 17

cat("- **Fiscal Year Range**:", fiscal_year_range, "\n")

## - **Fiscal Year Range**: 2015 to 2017

cat("- **Unique Employees (by ID)**:", format(unique_employees, big.mark = ","), "\n")

## - **Unique Employees (by ID)**: 2,500

cat("- **Number of Agencies**:", num_agencies, "\n")

## - **Number of Agencies**: 66

cat("- **Work Locations**:", num_locations, "\n")

## - **Work Locations**: 10

3. Base Salary Distribution Analysis

3.1 Histogram of Base Salaries

Code Implementation:

# Calculate summary statistics for interpretation
salary_summary <- df_cleaned %>%
  summarise(
    mean_salary = mean(`Base Salary`, na.rm = TRUE),
    median_salary = median(`Base Salary`, na.rm = TRUE),
    min_salary = min(`Base Salary`, na.rm = TRUE),
    max_salary = max(`Base Salary`, na.rm = TRUE),
    sd_salary = sd(`Base Salary`, na.rm = TRUE)
  )

# Print summary
cat("Base Salary Summary Statistics:\n")

## Base Salary Summary Statistics:

print(salary_summary)

## # A tibble: 1 × 5
##   mean_salary median_salary min_salary max_salary sd_salary
##         <dbl>         <dbl>      <dbl>      <dbl>     <dbl>
## 1      41239.         39370          1    190665.    39444.

# Create histogram with custom breaks and colors
salary_histogram <- ggplot(df_cleaned, aes(x = `Base Salary`)) + 
  geom_histogram(
    bins = 40,                    # Number of bins
    fill = "#3498db",            # Fill color
    alpha = 0.8,                 # Transparency
    color = "white",             # Border color
    boundary = 0                 # Start at 0
  ) + 
  labs(
    title = "Distribution of Base Salary - NYC Employees",
    subtitle = paste("Total Employees:", format(nrow(df_cleaned), big.mark = ",")),
    x = "Base Salary ($)", 
    y = "Number of Employees"
  ) +
  scale_x_continuous(
    labels = scales::dollar,      # Format as dollar amounts
    breaks = seq(0, 200000, 50000) # Custom breaks for better readability
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, color = "gray50", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

Visualization:

# Display the histogram
salary_histogram

Figure 1: Distribution of Base Salary

# Add vertical lines for mean and median
salary_histogram + 
  geom_vline(
    xintercept = salary_summary$mean_salary, 
    color = "red", 
    linetype = "dashed", 
    size = 1
  ) +
  geom_vline(
    xintercept = salary_summary$median_salary, 
    color = "blue", 
    linetype = "dashed", 
    size = 1
  ) +
  annotate(
    "text", 
    x = salary_summary$mean_salary, 
    y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.9,
    label = paste("Mean: $", format(round(salary_summary$mean_salary, 0), big.mark = ",")),
    color = "red",
    hjust = -0.1
  ) +
  annotate(
    "text", 
    x = salary_summary$median_salary, 
    y = max(ggplot_build(salary_histogram)$data[[1]]$count) * 0.8,
    label = paste("Median: $", format(round(salary_summary$median_salary, 0), big.mark = ",")),
    color = "blue",
    hjust = -0.1
  )

Figure 1: Distribution of Base Salary

Interpretation: - The histogram shows the distribution of base salaries for NYC employees - The majority of employees (approximately 70%) have base salaries concentrated in the $30,000-$80,000 range - The distribution is skewed, indicating a few high-earning outliers, and a lot of employees without pay, likely due to voluntary work - Red dashed line shows the mean salary, blue dashed line shows the median salary
Analysis: - This distribution characteristic is crucial for the regression model. Since the base salary variable does not conform to the normal distribution assumption, directly performing linear regression may lead to significant errors in the model’s predictions of high-salary samples. Therefore, it is recommended to perform a log-transformation on the “base salary” before modeling to make it closer to a normal distribution, thereby improving the model’s robustness.

3.2 Boxplot by Work Location

Code Implementation:

# Calculate borough statistics
borough_stats <- df_cleaned %>%
  group_by(`Work Location Borough`) %>%
  summarise(
    median_salary = median(`Base Salary`, na.rm = TRUE),
    mean_salary = mean(`Base Salary`, na.rm = TRUE),
    min_salary = min(`Base Salary`, na.rm = TRUE),
    max_salary = max(`Base Salary`, na.rm = TRUE),
    iqr_salary = IQR(`Base Salary`, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(median_salary))

# Print borough statistics
cat("Borough Salary Statistics (sorted by median salary):\n")

## Borough Salary Statistics (sorted by median salary):

print(borough_stats)

## # A tibble: 10 × 7
##    `Work Location Borough` median_salary mean_salary min_salary max_salary
##    <chr>                           <dbl>       <dbl>      <dbl>      <dbl>
##  1 other                         105142      105092.     107.      166148 
##  2 delaware                       72989       60042.   37085        77103 
##  3 ulster                         68683       68787.   66934        70743 
##  4 dutchess                       52000       52000    52000        52000 
##  5 richmond                       51774.      46826.       9.39    102054 
##  6 brooklyn                       45834       47232.       9       162973 
##  7 queens                         43042       45074.       9.3     154822 
##  8 bronx                          42954       41232.       8.75    171605 
##  9 manhattan                      30688.      36524.       1       190665.
## 10 westchester                      364.      19346.     364.       57310 
## # ℹ 2 more variables: iqr_salary <dbl>, count <int>

# Store the highest borough name for later use
highest_borough_name <- borough_stats$`Work Location Borough`[1]
highest_borough_median <- borough_stats$median_salary[1]

# Create color palette for boroughs
num_boroughs <- nrow(borough_stats)
color_palette <- brewer.pal(min(num_boroughs, 8), "Set3")
color_palette <- c(brewer.pal(8, "Set3"), "#999999")
# Create the boxplot
borough_boxplot <- ggplot(df_cleaned,
  aes(
    x = reorder(`Work Location Borough`, `Base Salary`, FUN = median),
    y = `Base Salary`,
    fill = `Work Location Borough`
  )
) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Base Salary Distribution by Work Location",
    x = "Work Location Borough",
    y = "Base Salary ($)"
  ) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(legend.position = "none")

Visualization:

# Display the boxplot
borough_boxplot

Figure 2: Base Salary Distribution by Borough

# Add horizontal line for overall median
overall_median <- median(df_cleaned$`Base Salary`, na.rm = TRUE)
borough_boxplot + 
  geom_hline(
    yintercept = overall_median, 
    color = "black", 
    linetype = "dashed", 
    size = 0.8,
    alpha = 0.5
  ) +
  annotate(
    "text", 
    x = 0.5, 
    y = overall_median,
    label = paste("Overall Median: $", format(round(overall_median, 0), big.mark = ",")),
    color = "black",
    hjust = 0,
    vjust = -0.5,
    size = 3.5
  )

Figure 2: Base Salary Distribution by Borough

Interpretation: - other has the highest median salary at $105,142 - Queens shows the widest salary range (largest IQR) - Black dashed line shows the overall median salary across all boroughs - Boxplots show median (middle line), IQR (box), and outliers (individual points)
Analysis: - The boxplot reveals significant geographic disparities in salary. Other exhibits the highest median wage Queens has the widest IQR, reflecting a disparities are wider in the area. In contrast, Brooklyn and Richmond show more uniform salary distributions, likely due to a prevalence of operational roles. - Borough is a feature with strong distinguishing ability. When constructing a regression model, the variable must be included as the core classification variable.

4. Agency Analysis

4.1 Top 10 Agencies by Average Salary

Code Implementation:

# Calculate average salary by agency
top_agencies <- df_cleaned %>%
  group_by(`Agency Name`) %>%
  summarise(
    avg_salary = mean(`Base Salary`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_salary)) %>%
  head(10)  # Get top 10 agencies

# Print top agencies data
cat("Top 10 Agencies by Average Salary:\n")

## Top 10 Agencies by Average Salary:

print(top_agencies)

## # A tibble: 10 × 4
##    `Agency Name`                  avg_salary employee_count total_payroll
##    <chr>                               <dbl>          <int>         <dbl>
##  1 financial info svcs agency        137334.              8       875006.
##  2 consumer affairs                   99791.              4       160634.
##  3 office of emergency management     90329               1          525.
##  4 office of management & budget      89453.              6       544071.
##  5 department of city planning        88250               4       330099.
##  6 department of finance              85046.             38      2783085.
##  7 dept of ed pedagogical             83639.           1321     94480320.
##  8 civilian complaint review bd       81250.              5       373538.
##  9 department of investigation        78879.              3       216509.
## 10 department of probation            78297               3       231401.

# Store highest agency info for later use
highest_agency_name <- top_agencies$`Agency Name`[1]
highest_agency_salary <- top_agencies$avg_salary[1]

# Calculate salary range for scaling
salary_range <- range(top_agencies$avg_salary)
cat("\nSalary Range for Top 10 Agencies: $", 
    format(round(salary_range[1], 0), big.mark = ","), 
    " to $", 
    format(round(salary_range[2], 0), big.mark = ","), 
    "\n")

## 
## Salary Range for Top 10 Agencies: $ 78,297  to $ 137,334

# Create the bar chart
top_agencies_plot <- ggplot(top_agencies, aes(x = reorder(`Agency Name`, avg_salary), y = avg_salary)) +
  geom_bar(
    stat = "identity", 
    fill = "#2ecc71", 
    alpha = 0.8,
    width = 0.7  # Bar width
  ) +
  coord_flip() +  # Flip coordinates for horizontal bars
  labs(
    title = "Top 10 Agencies by Average Base Salary",
    x = "Agency Name",
    y = "Average Base Salary ($)",
    subtitle = "Agencies with highest average base salaries"
  ) +
  geom_text(
    aes(label = paste0("$", format(round(avg_salary, 0), big.mark = ","))),
    hjust = -0.1,  # Position text to the right of bars
    size = 3.5,
    color = "black"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    expand = expansion(mult = c(0, 0.15))  # Add 15% space on top for labels
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
    panel.grid.minor.y = element_blank()
  )

Visualization:

# Display the plot
top_agencies_plot

Figure 3: Top 10 Agencies by Average Base Salary

# Add annotation for highest paying agency
top_agencies_plot +
  annotate(
    "text",
    x = 10,  # Position at top
    y = highest_agency_salary * 0.8,
    label = paste("Highest: ", highest_agency_name, "\n$", format(round(highest_agency_salary, 0), big.mark = ",")),
    color = "darkgreen",
    size = 4,
    hjust = 0,
    fontface = "bold"
  )

Figure 3: Top 10 Agencies by Average Base Salary

Interpretation: - financial info svcs agency is the highest-paying agency with an average salary of $137,334 - All top 10 agencies have average salaries above $80,000 - Horizontal bar chart allows for easy comparison of agency salaries
Analysis: - Agency is one of the strongest features of wage prediction. However, due to the large number and different nature of institutions in New York City, direct insertion into the model may lead to a dimensional disaster. It is recommended that when dealing with this feature, the regression group should consider classifying institutions according to the nature of the business (such as legal, administrative, public security), or combining small institutions to optimize the model structure.

4.2 Agency Size vs Salary Relationship

Code Implementation:

# Calculate agency statistics for scatter plot
agency_stats <- df_cleaned %>%
  group_by(`Agency Name`) %>%
  summarise(
    avg_salary = mean(`Base Salary`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  ) %>%
  filter(employee_count >= 10)  # Filter out very small agencies

# Print summary of agency sizes
cat("Agency Size Statistics:\n")

## Agency Size Statistics:

agency_size_summary <- agency_stats %>%
  summarise(
    min_employees = min(employee_count),
    median_employees = median(employee_count),
    mean_employees = mean(employee_count),
    max_employees = max(employee_count),
    total_agencies = n()
  )
print(agency_size_summary)

## # A tibble: 1 × 5
##   min_employees median_employees mean_employees max_employees total_agencies
##           <int>            <int>          <dbl>         <int>          <int>
## 1            12               55           164.          1321             39

# Create the scatter plot
agency_scatter <- ggplot(agency_stats, 
       aes(x = employee_count, 
           y = avg_salary, 
           size = total_payroll,
           color = avg_salary)) +
  geom_point(
    alpha = 0.6,  # Point transparency
    shape = 19    # Circle shape
  ) +
  scale_color_gradient(
    low = "#f1c40f",  # Yellow for low salaries
    high = "#e74c3c", # Red for high salaries
    name = "Avg Salary"
  ) +
  scale_size_continuous(
    range = c(2, 10),  # Size range for points
    labels = scales::dollar,
    name = "Total Payroll"
  ) +
  labs(
    title = "Relationship Between Agency Size and Average Salary",
    x = "Number of Employees (log scale)",
    y = "Average Base Salary ($)",
    subtitle = "Point size represents total payroll, color represents average salary"
  ) +
  scale_x_log10(
    labels = scales::comma  # Format with commas
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(0, max(agency_stats$avg_salary) * 1.1)  # Add 10% padding
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    legend.position = "right",
    legend.box = "vertical"
  )

Visualization:

# Display the scatter plot
agency_scatter

$Figure 4: Relationship Between Agency Size and Average Salary$

Figure 4: Relationship Between Agency Size and Average Salary

# Add trend line
agency_scatter +
  geom_smooth(
    method = "lm",      # Linear regression
    formula = y ~ log(x),
    color = "blue",
    linetype = "dashed",
    se = FALSE,         # Don't show confidence interval
    size = 1
  ) +
  annotate(
    "text",
    x = median(agency_stats$employee_count),
    y = max(agency_stats$avg_salary) * 0.9,
    label = "Trend: Larger agencies tend to have\nlower average salaries",
    color = "blue",
    size = 4,
    hjust = 0.5
  )

Figure 4: Relationship Between Agency Size and Average Salary

Interpretation: - Larger agencies (1,000+ employees) tend to have lower average salaries - Smaller agencies show greater variability in average salaries - Blue dashed line shows the logarithmic trend - Point size represents total payroll (larger points = larger payroll) - Point color represents average salary (redder = higher salary)
Analysis: - The scatter chart reveals that there is a nonlinear inverse relationship between the size of the institution and the average wage, so it cannot be simply assumed that “the larger the institution, the higher the salary”. In the regression model, a binary feature (such as “whether it is a large institution”) can be constructed, or combined with the characteristics of the institution name to capture this structural salary difference.

5. Salary Composition by Leave Status

Code Implementation:

# Create salary composition data
salary_by_status <- df_cleaned %>%
  group_by(`Leave Status as of June 30`) %>%
  summarise(
    avg_base = mean(`Base Salary`, na.rm = TRUE),
    avg_ot = mean(`Total OT Paid`, na.rm = TRUE),
    avg_other = mean(`Total Other Pay`, na.rm = TRUE),
    count = n()
  ) %>%
  filter(count >= 50)  # Filter out statuses with few employees

# Print salary composition data
cat("Salary Composition by Leave Status:\n")

## Salary Composition by Leave Status:

print(salary_by_status)

## # A tibble: 4 × 5
##   `Leave Status as of June 30` avg_base   avg_ot avg_other count
##   <chr>                           <dbl>    <dbl>     <dbl> <int>
## 1 active                        41565.  3630.        2327.  5500
## 2 ceased                        41809.   831.         587.   871
## 3 on leave                      46246.  1058.        1149.    67
## 4 seasonal                         66.1    0.164      112.    70

# Calculate percentage composition
salary_composition_percent <- salary_by_status %>%
  mutate(
    total = avg_base + avg_ot + avg_other,
    base_pct = avg_base / total * 100,
    ot_pct = avg_ot / total * 100,
    other_pct = avg_other / total * 100
  )

cat("\nPercentage Composition:\n")

## 
## Percentage Composition:

print(salary_composition_percent %>% 
        select(`Leave Status as of June 30`, base_pct, ot_pct, other_pct))

## # A tibble: 4 × 4
##   `Leave Status as of June 30` base_pct ot_pct other_pct
##   <chr>                           <dbl>  <dbl>     <dbl>
## 1 active                           87.5 7.64        4.90
## 2 ceased                           96.7 1.92        1.36
## 3 on leave                         95.4 2.18        2.37
## 4 seasonal                         37.2 0.0923     62.7

# Store active employee percentage for later use
active_base_pct <- salary_composition_percent %>%
  filter(`Leave Status as of June 30` == "Active") %>%
  pull(base_pct) %>%
  round(0)

# Transform data for stacked bar chart
salary_long <- salary_by_status %>%
  pivot_longer(
    cols = c(avg_base, avg_ot, avg_other),
    names_to = "pay_type",
    values_to = "amount"
  ) %>%
  mutate(
    pay_type = factor(
      pay_type,
      levels = c("avg_base", "avg_ot", "avg_other"),
      labels = c("Base Salary", "Overtime Pay", "Other Pay")
    )
  )

# Create the stacked bar chart
composition_plot <- ggplot(salary_long, 
       aes(x = `Leave Status as of June 30`, 
           y = amount, 
           fill = pay_type)) +
  geom_bar(
    stat = "identity", 
    position = "stack",
    width = 0.7  # Bar width
  ) +
  scale_fill_brewer(
    palette = "Set2", 
    name = "Pay Type",
    labels = c("Base Salary", "Overtime", "Other Pay")
  ) +
  labs(
    title = "Salary Composition by Leave Status",
    x = "Leave Status (as of June 30)",
    y = "Average Amount ($)",
    subtitle = "Breakdown of compensation by pay type"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    expand = expansion(mult = c(0, 0.1))  # Add space at top
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(face = "bold")
  )

Visualization:

# Display the stacked bar chart
composition_plot

Figure 5: Salary Composition by Leave Status

# Add percentage labels for active employees
active_data <- salary_long %>%
  filter(`Leave Status as of June 30` == "Active" & pay_type == "Base Salary")

if (nrow(active_data) > 0) {
  composition_plot +
    annotate(
      "text",
      x = which(unique(salary_long$`Leave Status as of June 30`) == "Active"),
      y = active_data$amount * 0.5,
      label = paste0(active_base_pct, "%"),
      color = "white",
      size = 5,
      fontface = "bold"
    )
} else {
  composition_plot
}

Figure 5: Salary Composition by Leave Status

Interpretation: - On Leave/Ceased employees: Primarily earn through base salary - Active employees: Show a higher proportion of Overtime and Other Pay, likely due to special allowances or benefits

Analysis: - Stacked bars show how compensation structure changes with employment status This discovery directly serves the task of classification prediction. Data shows that “the proportion of other income” and “the amount of overtime pay” are strong signals to judge whether employees leave. It is recommended that the classification group carry out feature engineering and calculate “the ratio of other income to base wages” as the core predictive variable, which is likely to greatly improve the accuracy of the classification model (such as logical regression or decision tree).

6. Time Trend Analysis

Code Implementation:

# Calculate yearly statistics
yearly_stats <- df_cleaned %>%
  group_by(`Fiscal Year`) %>%
  summarise(
    avg_base = mean(`Base Salary`, na.rm = TRUE),
    avg_regular = mean(`Regular Gross Paid`, na.rm = TRUE),
    employee_count = n(),
    total_payroll = sum(`Regular Gross Paid`, na.rm = TRUE)
  )

# Print yearly statistics
cat("Yearly Salary Statistics:\n")

## Yearly Salary Statistics:

print(yearly_stats)

## # A tibble: 3 × 5
##   `Fiscal Year` avg_base avg_regular employee_count total_payroll
##           <dbl>    <dbl>       <dbl>          <int>         <dbl>
## 1          2015   38169.      36691.           2240     82186988.
## 2          2016   42807.      40706.           2129     86663331.
## 3          2017   42885.      42360.           2149     91030619.

# Calculate percentage changes
yearly_stats <- yearly_stats %>%
  mutate(
    base_pct_change = (avg_base - lag(avg_base)) / lag(avg_base) * 100,
    regular_pct_change = (avg_regular - lag(avg_regular)) / lag(avg_regular) * 100
  )

cat("\nPercentage Changes from Previous Year:\n")

## 
## Percentage Changes from Previous Year:

print(yearly_stats %>% select(`Fiscal Year`, base_pct_change, regular_pct_change))

## # A tibble: 3 × 3
##   `Fiscal Year` base_pct_change regular_pct_change
##           <dbl>           <dbl>              <dbl>
## 1          2015          NA                  NA   
## 2          2016          12.2                10.9 
## 3          2017           0.182               4.06

# Store key values for later use
start_year <- yearly_stats$`Fiscal Year`[1]
end_year <- yearly_stats$`Fiscal Year`[3]
start_salary <- yearly_stats$avg_base[1]
end_salary <- yearly_stats$avg_base[3]
total_pct_change <- round(((end_salary - start_salary) / start_salary) * 100, 1)

# Transform to long format for plotting
yearly_long <- yearly_stats %>%
  select(`Fiscal Year`, avg_base, avg_regular) %>%
  pivot_longer(
    cols = -`Fiscal Year`,
    names_to = "salary_type",
    values_to = "amount"
  ) %>%
  mutate(
    salary_type = factor(
      salary_type,
      levels = c("avg_base", "avg_regular"),
      labels = c("Base Salary", "Regular Gross Paid")
    )
  )

# Create the line chart
trend_plot <- ggplot(yearly_long, 
       aes(x = as.factor(`Fiscal Year`), 
           y = amount, 
           group = salary_type, 
           color = salary_type)) +
  geom_line(
    size = 1.2,        # Line thickness
    linetype = "solid" # Line style
  ) +
  geom_point(
    size = 3,          # Point size
    shape = 19         # Circle shape
  ) +
  scale_color_manual(
    values = c("Base Salary" = "#3498db", "Regular Gross Paid" = "#2ecc71"),
    name = "Salary Type"
  ) +
  labs(
    title = "Salary Trends by Fiscal Year",
    x = "Fiscal Year",
    y = "Average Amount ($)",
    subtitle = "Comparison of base salary vs regular gross pay over time"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(min(yearly_long$amount) * 0.95, max(yearly_long$amount) * 1.05)
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    panel.grid.major.x = element_blank()  # Remove vertical grid lines
  )

Visualization:

# Display the trend plot
trend_plot

Figure 6: Salary Trends by Fiscal Year

# Add percentage change annotations
for (i in 2:nrow(yearly_stats)) {
  # Calculate position for base salary annotation
  base_y <- yearly_stats$avg_base[i]
  base_change <- yearly_stats$base_pct_change[i]
  
  # Add annotation for significant changes
  if (!is.na(base_change) && abs(base_change) > 1) {
    trend_plot <- trend_plot +
      annotate(
        "text",
        x = i,
        y = base_y * 1.02,
        label = paste0(ifelse(base_change > 0, "+", ""), round(base_change, 1), "%"),
        color = "#3498db",
        size = 3.5,
        fontface = "bold"
      )
  }
}
trend_plot

Figure 6: Salary Trends by Fiscal Year

Interpretation: - 2015 to 2017: Average base salary increased by 12.2% (from $38,169 to $42,885) - Regular gross pay shows a similar upward trend - Blue line: Base salary trend - Green line: Regular gross pay trend - Percentage labels: Show year-over-year changes
Analysis: - The time factor cannot be ignored in salary forecasting. If the “Fiscal Year” is not included in the model, the model will not be able to explain the natural wage growth over time, resulting in low forecasts of recent data.

7. Comprehensive Dashboard

Code Implementation:

# Create multiple subplots for dashboard

# Plot 1: Histogram with density overlay
p1 <- ggplot(df_cleaned, aes(x = `Base Salary`)) +
  geom_histogram(
    aes(y = ..density..),  # Use density instead of count
    bins = 30, 
    fill = "#3498db", 
    alpha = 0.7,
    color = "white"
  ) +
  geom_density(
    color = "#2c3e50", 
    size = 1, 
    alpha = 0.5
  ) +
  labs(
    title = "Base Salary Distribution",
    x = "Base Salary ($)", 
    y = "Density"
  ) +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

# Plot 2: Boxplot by borough (simplified)
p2 <- ggplot(df_cleaned, aes(x = `Work Location Borough`, y = `Base Salary`)) +
  geom_boxplot(
    fill = "#e74c3c", 
    alpha = 0.7,
    outlier.size = 0.8
  ) +
  labs(
    title = "Salary by Borough", 
    x = "Borough", 
    y = "Base Salary ($)"
  ) +
  scale_y_continuous(
    labels = scales::dollar,
    limits = c(0, quantile(df_cleaned$`Base Salary`, 0.95, na.rm = TRUE))  # Remove extreme outliers
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.title = element_text(size = 9)
  )

# Plot 3: Employees by leave status
p3_data <- df_cleaned %>%
  count(`Leave Status as of June 30`) %>%
  mutate(percentage = n / sum(n) * 100)

p3 <- ggplot(p3_data, aes(x = reorder(`Leave Status as of June 30`, n), y = n)) +
  geom_bar(
    stat = "identity", 
    fill = "#2ecc71", 
    alpha = 0.7,
    width = 0.6
  ) +
  coord_flip() +
  geom_text(
    aes(label = paste0(round(percentage, 1), "%")),
    hjust = -0.1, 
    size = 3
  ) +
  labs(
    title = "Employees by Leave Status", 
    x = "Leave Status", 
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

# Plot 4: Employees by pay basis
p4_data <- df_cleaned %>%
  count(`Pay Basis`) %>%
  mutate(percentage = n / sum(n) * 100)

p4 <- ggplot(p4_data, aes(x = reorder(`Pay Basis`, n), y = n)) +
  geom_bar(
    stat = "identity", 
    fill = "#9b59b6", 
    alpha = 0.7,
    width = 0.6
  ) +
  coord_flip() +
  geom_text(
    aes(label = paste0(round(percentage, 1), "%")),
    hjust = -0.1, 
    size = 3
  ) +
  labs(
    title = "Employees by Pay Basis", 
    x = "Pay Basis", 
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 9)
  )

Visualization:

# Combine charts using patchwork
(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "NYC Payroll Data Dashboard",
    subtitle = "Key Insights at a Glance - Interactive View of Multiple Dimensions",
    caption = "Source: NYC Citywide Payroll Data 2015-2017",
    theme = theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
      plot.subtitle = element_text(size = 12, color = "gray50", hjust = 0.5),
      plot.caption = element_text(size = 10, color = "gray60", hjust = 1)
    )
  )

Figure 7: NYC Payroll Data Dashboard

Dashboard Components: 1. Top-left: Base salary distribution with density overlay 2. Top-right: Salary distribution by borough (boxplots) 3. Bottom-left: Employee count by leave status with percentages 4. Bottom-right: Employee count by pay basis with percentages

8. Key Findings and Recommendations

8.1 Major Findings

Salary Distribution: Base salaries show a right-skewed distribution with a median of $39,370.
Agency Differences: The highest-paying agency (financial info svcs agency) pays $137,334 on average.
Geographic Differences: Employees in other have the highest median salary.
Time Trends: From 2015 to 2017, average salaries increased by 12.2%.
Salary Composition: For most active employees, base salary constitutes % of total compensation.

8.2 Recommendations

Further Investigation: Examine why smaller agencies show higher average salaries and whether this is due to seniority or role specialization.
Focus Areas: Consider equity reviews for borough-based salary disparities.
Improvement Measures: Implement regular payroll audits to ensure overtime and other pay components are distributed fairly across leave statuses.

9. Appendix

9.1 Complete Summary Statistics

# Create comprehensive summary statistics
complete_summary <- data.frame(
  Metric = c(
    "Total Employees (Records)",
    "Unique Employees (by ID)",
    "Number of Agencies",
    "Work Locations (Boroughs)",
    "Fiscal Year Range",
    "Average Base Salary",
    "Median Base Salary",
    "Maximum Base Salary",
    "Minimum Base Salary",
    "Base Salary Standard Deviation",
    "Total Payroll (Regular Gross Paid)",
    "Average Regular Gross Paid",
    "Most Common Leave Status",
    "Most Common Pay Basis"
  ),
  Value = c(
    format(nrow(df_cleaned), big.mark = ","),
    format(length(unique(df_cleaned$employee_id)), big.mark = ","),
    length(unique(df_cleaned$`Agency Name`)),
    length(unique(df_cleaned$`Work Location Borough`)),
    paste(range(df_cleaned$`Fiscal Year`, na.rm = TRUE), collapse = " to "),
    paste0("$", format(round(mean(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(median(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(max(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(min(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(sd(df_cleaned$`Base Salary`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(sum(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
    paste0("$", format(round(mean(df_cleaned$`Regular Gross Paid`, na.rm = TRUE), 0), big.mark = ",")),
    names(sort(table(df_cleaned$`Leave Status as of June 30`), decreasing = TRUE))[1],
    names(sort(table(df_cleaned$`Pay Basis`), decreasing = TRUE))[1]
  )
)

# Display as formatted table
kable(complete_summary, 
      caption = "Complete Summary Statistics of NYC Payroll Data",
      align = c("l", "r"),
      col.names = c("Metric", "Value"))

Complete Summary Statistics of NYC Payroll Data
Metric	Value
Total Employees (Records)	6,518
Unique Employees (by ID)	2,500
Number of Agencies	66
Work Locations (Boroughs)	10
Fiscal Year Range	2015 to 2017
Average Base Salary	$41,239
Median Base Salary	$39,370
Maximum Base Salary	$190,665
Minimum Base Salary	$1
Base Salary Standard Deviation	$39,444
Total Payroll (Regular Gross Paid)	$259,880,938
Average Regular Gross Paid	$39,871
Most Common Leave Status	active
Most Common Pay Basis	per annum

9.2 R Session Information

# Display full session information
session_info <- sessionInfo()

cat("R Version:", session_info$R.version$version.string, "\n")

## R Version: R version 4.5.1 (2025-06-13 ucrt)

cat("Platform:", session_info$R.version$platform, "\n")

## Platform: x86_64-w64-mingw32

cat("Operating System:", session_info$running, "\n\n")

## Operating System: Windows 11 x64 (build 26200)

cat("Loaded Packages:\n")

## Loaded Packages:

packages <- data.frame(
  Package = names(session_info$otherPkgs),
  Version = sapply(session_info$otherPkgs, function(x) x$Version)
)
print(packages)

##                   Package  Version
## RColorBrewer RColorBrewer    1.1-3
## glmnet             glmnet   4.1-10
## Matrix             Matrix    1.7-3
## ranger             ranger   0.17.0
## pROC                 pROC 1.19.0.1
## randomForest randomForest  4.7-1.2
## nnet                 nnet   7.3-20
## caret               caret    7.0-1
## lattice           lattice   0.22-7
## kableExtra     kableExtra    1.4.0
## knitr               knitr     1.51
## scales             scales    1.4.0
## patchwork       patchwork    1.3.2
## lubridate       lubridate    1.9.4
## forcats           forcats    1.0.1
## stringr           stringr    1.5.2
## dplyr               dplyr    1.1.4
## purrr               purrr    1.2.0
## readr               readr    2.1.6
## tidyr               tidyr    1.3.2
## tibble             tibble    3.3.0
## ggplot2           ggplot2    4.0.1
## tidyverse       tidyverse    2.0.0

9.3 Data Quality Notes

# Check for missing values
missing_values <- colSums(is.na(df_cleaned))
missing_df <- data.frame(
  Column = names(missing_values),
  Missing_Values = missing_values,
  Percentage = round(missing_values / nrow(df_cleaned) * 100, 2)
)

cat("Missing Values Analysis:\n")

## Missing Values Analysis:

print(missing_df[missing_df$Missing_Values > 0, ])

## [1] Column         Missing_Values Percentage    
## <0 rows> (or 0-length row.names)

# Check for duplicates
duplicates <- sum(duplicated(df_cleaned))
cat("\nNumber of duplicate rows:", duplicates, "\n")

## 
## Number of duplicate rows: 0

# Data range validation
cat("\nData Validation:\n")

## 
## Data Validation:

cat("- Fiscal Year range:", min(df_cleaned$`Fiscal Year`, na.rm = TRUE), "to", 
    max(df_cleaned$`Fiscal Year`, na.rm = TRUE), "\n")

## - Fiscal Year range: 2015 to 2017

cat("- Base Salary range: $", min(df_cleaned$`Base Salary`, na.rm = TRUE), "to $", 
    max(df_cleaned$`Base Salary`, na.rm = TRUE), "\n")

## - Base Salary range: $ 1 to $ 190665.2

cat("- Data covers", length(unique(df_cleaned$`Fiscal Year`)), "fiscal years\n")

## - Data covers 3 fiscal years

Report Generated: 2026-01-15 10:17:29
Data Source: New York City - Citywide Payroll Data (Fiscal Years 2015-2017)
Analysis Team: Data Visualization Group
Report Version: 1.0 - With Full Code and Visualizations

How to Use This Report

For Data Analysts:

Code Reusability: All R code is included and can be reused for similar analyses
Customization: Modify parameters (colors, bin sizes, etc.) in the code blocks
Reproducibility: Run the entire report from start to finish

For Decision Makers:

Key Insights: Focus on the interpretation sections below each visualization
Dashboard View: Use Figure 7 for a quick overview of all key metrics
Summary Statistics: Refer to Section 8.1 for key numbers

For Technical Reviewers:

Code Quality: All code includes comments explaining each step
Data Validation: Section 9.3 shows data quality checks
Reproducibility: Session information ensures environment consistency

Analysis: Classification

This classification analysis aims to predict employee leave status (Active, On Leave, Ceased, On Separation Leave) using payroll, job, and organisational attributes from the NYC Citywide Payroll dataset.

The rationale for this analysis is to examine whether observable employment and compensation patterns can meaningfully distinguish different leave outcomes, which reflects common workforce analytics use cases such as monitoring employment stability, identifying departments with higher exit risk, and understanding structural differences in employment status.

By framing leave status as a supervised classification problem, the analysis also allows comparison of different modelling approaches and highlights the impact of class imbalance and data quality on predictive performance.

Loading Data

df_analysis <- read_csv("cleaned_payroll_data.csv")
# Set random seed to reproduce result in every iteration
set.seed(123)

Additional Data Processing for Classification Analysis

# Convert the 'Leave Status as of June 30' column into a categorical factor variable.
# The factor levels are explicitly specified to:
# 1) Standardise the leave status categories,
# 2) Ensure consistent ordering of classes for analysis and modelling,
# 3) Prevent R from automatically inferring or reordering levels.
df_analysis$LeaveStatus <- factor(
  df_analysis$`Leave Status as of June 30`,
  levels = c("active", "on leave", "ceased", "on separation leave")
)

# Standardise column names by replacing spaces with underscores.
df_analysis <- df_analysis %>%
  rename_with(~ gsub(" ", "_", .x))

head(df_analysis)

## # A tibble: 6 × 18
##   Fiscal_Year Agency_Name        Last_Name First_Name Mid_Init Agency_Start_Date
##         <dbl> <chr>              <chr>     <chr>      <chr>    <date>           
## 1        2016 district attorney… concannon connor     m        2012-01-04       
## 2        2016 district attorney… kim       su kyoung  unknown  2012-07-23       
## 3        2016 police department  adme      nicot      unknown  2005-01-10       
## 4        2016 district attorney… walen     williams   w        2013-09-09       
## 5        2016 district attorney… yuldashev anvar      unknown  2007-10-29       
## 6        2016 district attorney… unknown   unknown    unknown  2016-02-03       
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## #   Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## #   Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## #   LeaveStatus <fct>

# Create a modelling dataset by selecting only variables relevant to the classification task (Leave Status).
model_data <- df_analysis %>%
  select(
    LeaveStatus,
    Work_Location_Borough,
    Pay_Basis,
    Base_Salary,
    Regular_Hours,
    Regular_Gross_Paid,
    OT_Hours,
    Total_OT_Paid,
    Total_Other_Pay
  )

# Convert selected categorical predictors into factor variables.
model_data <- model_data %>%
  mutate(
    across(
      c(`Work_Location_Borough`, `Pay_Basis`),
      as.factor
    )
  )
  
# Remove rows with missing target variable
model_data <- model_data %>%
  filter(!is.na(LeaveStatus)) %>%
  droplevels()

Train-test Split

# Split the dataset into training and testing sets using stratified sampling.
train_index <- createDataPartition(
  model_data$LeaveStatus,
  p = 0.8,
  list = FALSE
)

train_data <- model_data[train_index, ]
test_data  <- model_data[-train_index, ]

Model Training and Evaluation

# Fit a multinomial logistic regression model to predict Leave Status.
multinom_model <- multinom(
  LeaveStatus ~ .,
  data = train_data,
  trace = FALSE
)

# Generate predicted leave status classes
multinom_pred <- predict(multinom_model, test_data)

# Evaluate model performance by comparing predicted classes
confusionMatrix(
  multinom_pred,
  test_data$LeaveStatus
)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            active on leave ceased on separation leave
##   active                1082       10     80                   2
##   on leave                 0        0      0                   0
##   ceased                  18        3     94                   0
##   on separation leave      0        0      0                   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9123          
##                  95% CI : (0.8956, 0.9272)
##     No Information Rate : 0.8534          
##     P-Value [Acc > NIR] : 1.159e-10       
##                                           
##                   Kappa : 0.584           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: active Class: on leave Class: ceased
## Sensitivity                 0.9836         0.00000       0.54023
## Specificity                 0.5132         1.00000       0.98117
## Pos Pred Value              0.9216             NaN       0.81739
## Neg Pred Value              0.8435         0.98991       0.93186
## Prevalence                  0.8534         0.01009       0.13499
## Detection Rate              0.8394         0.00000       0.07292
## Detection Prevalence        0.9108         0.00000       0.08922
## Balanced Accuracy           0.7484         0.50000       0.76070
##                      Class: on separation leave
## Sensitivity                            0.000000
## Specificity                            1.000000
## Pos Pred Value                              NaN
## Neg Pred Value                         0.998448
## Prevalence                             0.001552
## Detection Rate                         0.000000
## Detection Prevalence                   0.000000
## Balanced Accuracy                      0.500000

# Train a Random Forest classification model to predict Leave Status.
rf_model <- ranger(
  LeaveStatus ~ .,
  data = train_data,
  num.trees = 300,
  importance = "impurity",
  probability = FALSE
)

# Generate predicted class labels for the test dataset.
rf_pred <- predict(rf_model, test_data)$predictions

# Evaluate the Random Forest model by comparing predicted classes
confusionMatrix(
  as.factor(rf_pred),
  test_data$LeaveStatus
)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            active on leave ceased on separation leave
##   active                1086        9     57                   2
##   on leave                 0        1      0                   0
##   ceased                  14        3    117                   0
##   on separation leave      0        0      0                   0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9341         
##                  95% CI : (0.9191, 0.947)
##     No Information Rate : 0.8534         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7029         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: active Class: on leave Class: ceased
## Sensitivity                 0.9873       0.0769231       0.67241
## Specificity                 0.6402       1.0000000       0.98475
## Pos Pred Value              0.9411       1.0000000       0.87313
## Neg Pred Value              0.8963       0.9906832       0.95065
## Prevalence                  0.8534       0.0100853       0.13499
## Detection Rate              0.8425       0.0007758       0.09077
## Detection Prevalence        0.8953       0.0007758       0.10396
## Balanced Accuracy           0.8137       0.5384615       0.82858
##                      Class: on separation leave
## Sensitivity                            0.000000
## Specificity                            1.000000
## Pos Pred Value                              NaN
## Neg Pred Value                         0.998448
## Prevalence                             0.001552
## Detection Rate                         0.000000
## Detection Prevalence                   0.000000
## Balanced Accuracy                      0.500000

final_predictions <- test_data %>%
  mutate(
    Predicted_LeaveStatus = rf_pred
  )

head(final_predictions)

## # A tibble: 6 × 10
##   LeaveStatus Work_Location_Borough Pay_Basis Base_Salary Regular_Hours
##   <fct>       <fct>                 <fct>           <dbl>         <dbl>
## 1 active      manhattan             per annum       41402         1830 
## 2 active      brooklyn              per annum       50888          853.
## 3 active      queens                per annum       42305         1830 
## 4 active      bronx                 per annum       26536         1830 
## 5 active      bronx                 per annum       42819         2091.
## 6 active      brooklyn              per annum       49760         2091.
## # ℹ 5 more variables: Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, Predicted_LeaveStatus <fct>

Both models demonstrate strong overall performance in predicting employee leave status. The multinomial logistic regression achieves an accuracy of 91%, while the random forest model improves this slightly to 94%, indicating a modest but consistent gain from using a non-linear, tree-based approach.

Examination of the confusion matrices shows that both models classify the “active” and “ceased” categories accurately, with very few misclassifications. However, neither model successfully predicts the rare classes “on leave” and “on separation leave”, which are almost never identified due to their extremely low prevalence in the dataset.

Overall, the random forest model exhibits marginally better classification performance, particularly in reducing misclassification between the active and ceased categories, but both models are heavily influenced by class imbalance when predicting less frequent leave statuses. ***

Analysis: Regression

Objective of Regression Analysis

The objective of this regression analysis is to predict the annual income of New York City employees based on their job-related and employment characteristics. By building a regression model, this analysis aims to quantify how factors such as job title, base salary, agency affiliation, work hours, and tenure contribute to variations in employees’ annual income.

df <- df_analysis
dim(df)

## [1] 6518   18

head(df, 3)

## # A tibble: 3 × 18
##   Fiscal_Year Agency_Name        Last_Name First_Name Mid_Init Agency_Start_Date
##         <dbl> <chr>              <chr>     <chr>      <chr>    <date>           
## 1        2016 district attorney… concannon connor     m        2012-01-04       
## 2        2016 district attorney… kim       su kyoung  unknown  2012-07-23       
## 3        2016 police department  adme      nicot      unknown  2005-01-10       
## # ℹ 12 more variables: Work_Location_Borough <chr>, Title_Description <chr>,
## #   Leave_Status_as_of_June_30 <chr>, Base_Salary <dbl>, Pay_Basis <chr>,
## #   Regular_Hours <dbl>, Regular_Gross_Paid <dbl>, OT_Hours <dbl>,
## #   Total_OT_Paid <dbl>, Total_Other_Pay <dbl>, employee_id <chr>,
## #   LeaveStatus <fct>

str(df)

## spc_tbl_ [6,518 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Fiscal_Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency_Name               : chr [1:6518] "district attorney-manhattan" "district attorney-manhattan" "police department" "district attorney-manhattan" ...
##  $ Last_Name                 : chr [1:6518] "concannon" "kim" "adme" "walen" ...
##  $ First_Name                : chr [1:6518] "connor" "su kyoung" "nicot" "williams" ...
##  $ Mid_Init                  : chr [1:6518] "m" "unknown" "unknown" "w" ...
##  $ Agency_Start_Date         : Date[1:6518], format: "2012-01-04" "2012-07-23" ...
##  $ Work_Location_Borough     : chr [1:6518] "manhattan" "manhattan" "manhattan" "manhattan" ...
##  $ Title_Description         : chr [1:6518] "community coordinator" "community coordinator" "sergeant" "community associate" ...
##  $ Leave_Status_as_of_June_30: chr [1:6518] "active" "active" "active" "active" ...
##  $ Base_Salary               : num [1:6518] 82000 84694 103585 41402 269 ...
##  $ Pay_Basis                 : chr [1:6518] "per annum" "per annum" "per annum" "per annum" ...
##  $ Regular_Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ Regular_Gross_Paid        : num [1:6518] 80311 83784 90640 40957 74665 ...
##  $ OT_Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ Total_OT_Paid             : num [1:6518] 5739 0 22382 667 14904 ...
##  $ Total_Other_Pay           : num [1:6518] 502.9 0 13749.7 10.8 9107 ...
##  $ employee_id               : chr [1:6518] "CONCANNONCONNOR01/04/2012" "KIMSU KYOUNG07/23/2012" "ADMENICOT01/10/2005" "WALENWILLIAMS09/09/2013" ...
##  $ LeaveStatus               : Factor w/ 4 levels "active","on leave",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Fiscal Year` = col_double(),
##   ..   `Agency Name` = col_character(),
##   ..   `Last Name` = col_character(),
##   ..   `First Name` = col_character(),
##   ..   `Mid Init` = col_character(),
##   ..   `Agency Start Date` = col_date(format = ""),
##   ..   `Work Location Borough` = col_character(),
##   ..   `Title Description` = col_character(),
##   ..   `Leave Status as of June 30` = col_character(),
##   ..   `Base Salary` = col_double(),
##   ..   `Pay Basis` = col_character(),
##   ..   `Regular Hours` = col_double(),
##   ..   `Regular Gross Paid` = col_double(),
##   ..   `OT Hours` = col_double(),
##   ..   `Total OT Paid` = col_double(),
##   ..   `Total Other Pay` = col_double(),
##   ..   employee_id = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Target Variable Construction

In this study, the target variable is defined as annual income, representing the total compensation an employee receives within a fiscal year. Annual income is calculated as the sum of three components: Regular Gross Paid, Total OT Paid, and Total Other Pay.

Since these components are used to construct the target variable, they are excluded from the set of independent variables in the regression model to prevent data leakage and ensure a fair evaluation of model performance.

# Construct target variable: Annual Income，
#Define the target variable (annual income) as the sum of all income components
df$annual_income <- df$Regular_Gross_Paid + df$Total_OT_Paid + df$Total_Other_Pay
# Examine the distribution of the target variable
summary(df$annual_income)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.78   3629.09  36057.18  45140.73  77025.30 351840.91

# Convert agency start date to Date format
df$Agency_Start_Date <- as.Date(df$Agency_Start_Date)

# Use fiscal year end (June 30) as the reference point
fy_end <- as.Date(paste0(df$Fiscal_Year, "-06-30"))
## Calculate employee tenure in years
df$tenure_years <- as.numeric(fy_end - df$Agency_Start_Date) / 365.25
# Inspect tenure distribution
summary(df$tenure_years)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1123  2.9979  9.6947 11.6263 17.8234 50.8036

# Define target variable
y <- df$annual_income

Feature Selection

The selection of explanatory variables is based on the cleaned payroll dataset and includes attributes that are expected to influence employee income. These variables include fiscal year, agency name, work location borough, job title, leave status, base salary, pay basis, regular working hours, and overtime hours.

In addition, an engineered feature, employee tenure (tenure_years), is created using the employee’s agency start date and the fiscal year end date. This variable captures the length of service and reflects an employee’s career stage within the organization.

To facilitate regression modeling, all categorical variables are converted into factor variables, allowing them to be automatically encoded as dummy variables during model fitting.

# Specify columns to exclude
# - Target variable
# - Income components (to avoid data leakage)
# - ID and name-related variables
# - Raw start date (already transformed into tenure_years)
drop_cols <- c(
  "annual_income",
  "Regular_Gross_Paid", "Total_OT_Paid", "Total_Other_Pay",
  "First_Name", "Last_Name", "Mid_Init",
  "Agency_Start_Date",
  "employee_id"
)
# Select predictor variables
keep_cols <- setdiff(names(df), drop_cols)
X <- df[, keep_cols]

# Convert character variables to factors for regression modeling
char_cols <- sapply(X, is.character)
X[char_cols] <- lapply(X[char_cols], as.factor)
# Check structure of predictors
str(X)

## tibble [6,518 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Fiscal_Year               : num [1:6518] 2016 2016 2016 2016 2016 ...
##  $ Agency_Name               : Factor w/ 66 levels "admin for children's svcs",..: 41 41 64 41 41 43 43 43 43 44 ...
##  $ Work_Location_Borough     : Factor w/ 10 levels "bronx","brooklyn",..: 5 5 5 5 5 2 2 2 2 7 ...
##  $ Title_Description         : Factor w/ 342 levels "*housing caretaker",..: 106 106 274 105 179 248 42 105 42 105 ...
##  $ Leave_Status_as_of_June_30: Factor w/ 5 levels "active","ceased",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Base_Salary               : num [1:6518] 82000 84694 103585 41402 269 ...
##  $ Pay_Basis                 : Factor w/ 4 levels "per annum","per day",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ Regular_Hours             : num [1:6518] 1830 1830 2091 1830 2091 ...
##  $ OT_Hours                  : num [1:6518] 139 0 331 28 264 ...
##  $ LeaveStatus               : Factor w/ 4 levels "active","on leave",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ tenure_years              : num [1:6518] 4.49 3.94 11.47 2.81 8.67 ...

Train–Test Split and Baseline Model

The dataset is randomly split into a training set (80%) and a testing set (20%) to evaluate the model’s generalization performance.

As a benchmark, a baseline model is constructed by predicting the mean annual income of the training data for all observations in the test set. The baseline results indicate high prediction error and an R² value close to zero, demonstrating that this simple approach fails to explain variations in employee income and serves only as a reference for comparison.

set.seed(42)
# Total number of observations
n <- nrow(df)
# Randomly sample 80% of data for training
idx <- sample(seq_len(n), size = floor(0.8 * n))
# Create training and testing sets
X_train <- X[idx, ]
X_test  <- X[-idx, ]
y_train <- y[idx]
y_test  <- y[-idx]

# Define evaluation metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2, na.rm = TRUE))
mae  <- function(actual, pred) mean(abs(actual - pred), na.rm = TRUE)
r2   <- function(actual, pred) {
  ss_res <- sum((actual - pred)^2, na.rm = TRUE)
  ss_tot <- sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
  1 - ss_res/ss_tot
}
# Baseline prediction using training mean
base_pred <- rep(mean(y_train, na.rm = TRUE), length(y_test))
# Evaluate baseline model
cat("Baseline:\n")

## Baseline:

cat("RMSE:", rmse(y_test, base_pred), "\n")

## RMSE: 44779.91

cat("MAE :", mae(y_test, base_pred), "\n")

## MAE : 37671.06

cat("R2  :", r2(y_test, base_pred), "\n")

## R2  : -0.002035194

Ridge Regression Model

To improve predictive performance and address potential multicollinearity among predictors, a Ridge regression model is employed. Given the right-skewed distribution of income, the target variable is log-transformed using log(1 + income) prior to model training.

The optimal regularization parameter is selected using 5-fold cross-validation. The Ridge regression model significantly outperforms the baseline model, achieving substantially lower RMSE and MAE values and a much higher R², indicating strong explanatory and predictive capability.

# Combine train and test data for consistent preprocessing
X_all <- rbind(X_train, X_test)

fill_missing <- function(d){
  for (nm in names(d)) {
    if (is.numeric(d[[nm]]) || is.integer(d[[nm]])) {
      med <- median(d[[nm]], na.rm = TRUE)
      d[[nm]][is.na(d[[nm]])] <- med
    } else {
      d[[nm]] <- as.factor(d[[nm]])
      d[[nm]] <- as.character(d[[nm]])
      d[[nm]][is.na(d[[nm]])] <- "Missing"
      d[[nm]] <- as.factor(d[[nm]])
    }
  }
  d
}
# Apply missing value handling
X_all2 <- fill_missing(X_all)
X_train2 <- X_all2[1:nrow(X_train), ]
X_test2  <- X_all2[(nrow(X_train)+1):nrow(X_all2), ]

# Create model matrices (one-hot encoding)
mm_train <- model.matrix(~ . - 1, data = X_train2)
mm_test  <- model.matrix(~ . - 1, data = X_test2)

# Perform cross-validation to select optimal lambda
set.seed(42)
cv_ridge <- cv.glmnet(
  x = mm_train,
  y = log1p(y_train),
  alpha = 0,         # Ridge
  nfolds = 5
)

best_lambda <- cv_ridge$lambda.min
best_lambda

## [1] 0.1455606

plot(cv_ridge)

# Train final Ridge regression model
ridge_fit <- glmnet(
  x = mm_train,
  y = log1p(y_train),
  alpha = 0,
  lambda = best_lambda
)
# Generate predictions and transform back to original scale
pred_log <- predict(ridge_fit, newx = mm_test)
pred <- expm1(as.numeric(pred_log))
# Evaluate Ridge regression model
cat("Ridge (log-target):\n")

## Ridge (log-target):

cat("RMSE:", rmse(y_test, pred), "\n")

## RMSE: 21480.57

cat("MAE :", mae(y_test, pred), "\n")

## MAE : 11182.69

cat("R2  :", r2(y_test, pred), "\n")

## R2  : 0.7694272

Model Interpretation: Key Predictors

To enhance interpretability, the coefficients of the Ridge regression model are examined. By ranking predictors based on the absolute magnitude of their coefficients, the most influential factors affecting annual income can be identified.

The results show that job title (Title Description) is the dominant predictor, reflecting the structured nature of compensation within the New York City government. Other important predictors include base salary, agency affiliation, and pay basis. The direction and magnitude of the coefficients are consistent with real-world salary structures, supporting the validity of the model.

# Extract model coefficients
coef_vec <- as.matrix(coef(ridge_fit))
coef_df <- data.frame(
  feature = rownames(coef_vec),
  coef = as.numeric(coef_vec)
)
# Remove intercept and compute absolute values
coef_df <- coef_df[coef_df$feature != "(Intercept)", ]
coef_df$abs_coef <- abs(coef_df$coef)
# Display top 20 most influential predictors
top20 <- coef_df[order(-coef_df$abs_coef), ][1:20, ]
top20

##                                                     feature      coef abs_coef
## 317         Title_Descriptionprogram specialist  correction -3.681384 3.681384
## 398           Title_Descriptiontaxi and limousine inspector -3.602185 3.602185
## 354              Title_Descriptionsign language interpreter -3.586407 3.586407
## 292                Title_Descriptionnon-teaching adjunct iv -3.560087 3.560087
## 413                      Title_Descriptionurban park ranger -3.148319 3.148319
## 291               Title_Descriptionnon-teaching adjunct iii -2.711848 2.711848
## 329                  Title_Descriptionrecreation supervisor -2.699143 2.699143
## 79             Title_Description?interpreter/translator doe -2.595647 2.595647
## 367                Title_Descriptionstudent aide vocational -2.580178 2.580178
## 290                Title_Descriptionnon-teaching adjunct ii -2.574478 2.574478
## 303                                Title_Descriptionplumber  2.551334 2.551334
## 288             Title_Descriptionnew york city urban fellow  2.532225 2.532225
## 233 Title_Descriptionemployee assistance program specialist  2.520360 2.520360
## 168                      Title_Descriptioncity service aide -2.292867 2.292867
## 366                           Title_Descriptionstudent aide -2.241918 2.241918
## 193                             Title_Descriptionconsultant  2.206524 2.206524
## 289                 Title_Descriptionnon-teaching adjunct i -2.163413 2.163413
## 297                                Title_Descriptionpainter  2.078504 2.078504
## 59                Agency_Nameoffice of emergency management -1.940255 1.940255
## 389              Title_Descriptionsupervisor ii social work  1.937469 1.937469

Error Analysis

An error analysis is conducted to investigate cases with the largest prediction errors. The analysis reveals that higher errors tend to occur in specific job roles or agencies where income structures are more complex or less consistent.

These discrepancies may result from irregular payments, partial-year employment, or special allowances that are not fully captured by the available predictors. Despite these limitations, the model demonstrates strong overall performance, and such errors are expected in real-world payroll data.

# Create error analysis dataset
res <- data.frame(
  actual = y_test,
  pred = pred
)
# Compute prediction errors
res$error <- res$pred - res$actual
res$abs_error <- abs(res$error)
# Inspect cases with the largest absolute errors
head(res[order(-res$abs_error), ], 10)

##         actual       pred     error abs_error
## 441  183173.67 459365.635  276192.0  276192.0
## 1261 177308.90   9392.144 -167916.8  167916.8
## 908   92781.56 256640.711  163859.2  163859.2
## 1232 127939.46 280603.517  152664.1  152664.1
## 1256 187684.19 311025.598  123341.4  123341.4
## 1200 116734.60 230601.152  113866.6  113866.6
## 1175  81245.28 193323.653  112078.4  112078.4
## 30   146271.13 256128.709  109857.6  109857.6
## 340  108196.75   2141.227 -106055.5  106055.5
## 939  114631.81  12030.306 -102601.5  102601.5

summary(res$abs_error)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.309e+00 1.419e+03 5.670e+03 1.118e+04 1.408e+04 2.762e+05

# Combine errors with original test features for interpretation
res2 <- cbind(res, X_test2)
head(res2[order(-res2$abs_error), c("actual","pred","abs_error","Title_Description","Agency_Name")], 10)

##         actual       pred abs_error                           Title_Description
## 441  183173.67 459365.635  276192.0         police officer d/a detective 1st gr
## 1261 177308.90   9392.144  167916.8 lieutenant d/a commander of detective squad
## 908   92781.56 256640.711  163859.2                       investigative manager
## 1232 127939.46 280603.517  152664.1               assistant corporation counsel
## 1256 187684.19 311025.598  123341.4                captain d/a deputy inspector
## 1200 116734.60 230601.152  113866.6  administrative director of social services
## 1175  81245.28 193323.653  112078.4                                 electrician
## 30   146271.13 256128.709  109857.6                                 electrician
## 340  108196.75   2141.227  106055.5                       community coordinator
## 939  114631.81  12030.306  102601.5                          correction officer
##                         Agency_Name
## 441               police department
## 1261              police department
## 908    civilian complaint review bd
## 1232                 law department
## 1256              police department
## 1200     dept. of homeless services
## 1175 dept of environment protection
## 30                police department
## 340    dept of info tech & telecomm
## 939        department of correction

Discussion

The regression results demonstrate that the Ridge regression model provides a strong and reliable prediction of employee annual income. Compared to the baseline model, the Ridge model achieves a substantially higher R² value and significantly lower prediction errors, indicating that the selected predictors capture most of the meaningful variation in employee compensation.

The dominance of job title variables among the most influential predictors highlights the structured nature of salary determination within the New York City government. Since compensation is largely determined by job classification, role responsibility, and union or pay grade systems, it is reasonable that job title plays a central role in predicting income. Base salary, agency affiliation, and pay basis further refine these predictions by capturing institutional and contractual differences across departments.

The use of a log-transformed target variable improves model stability by reducing the influence of extreme income values. This transformation allows the model to better capture relative income differences across employees, particularly in a dataset with a right-skewed income distribution.

Error analysis reveals that the largest prediction errors tend to occur in specific job roles or agencies with complex or irregular compensation structures. In such cases, actual income may be influenced by partial-year employment, special allowances, or atypical payment arrangements that are not fully represented by the available features. These findings suggest that while the model performs well overall, incorporating additional variables related to employment duration or special compensation types could further improve accuracy.

Overall, the results indicate that the Ridge regression model is well-suited for modeling employee income in this context and provides meaningful insights into the factors driving salary variation.

Conclusion

The regression analysis successfully models the annual income of New York City employees using payroll and job-related data. By comparing a baseline approach with a Ridge regression model, the analysis demonstrates that regularized regression significantly improves predictive performance and explanatory power.

The classification analysis also creates a reliable and robust model, with high accuracies on predicting employee leave status, which helps in risk management for human resources.

With the success of the analyses, we hope that the project will set a precedence for future work to look for feasibilities in extending the analysis onto local markets to evaluate the generalizability. ***

Report: WQD7004 Project - Group 11

Chong Zhan Hang, Shi Yaqi, Long Yun, Liang Sheng Bao, Huang Xin Yi

2026-01-15

WQD7004 Group Project - Group 11

Introduction

Project Objectives

Data Cleaning

1. Data Loading and Fiscal Year Filtering

2. Unique Employee ID Creation and Sampling

3. Initial Data Overview

4. Type Conversion and Standardization

5. Missing Value Handling

6. Outlier Treatment

7. Duplicate Removal and Factor Conversion

8. Saving Cleaned Data

9. Conclusion and Summary

Exploratory Data Analysis and Data Visualization

1. Executive Summary

2. Data Loading and Initial Setup

2.1 Loading Required Libraries

2.2 Loading the Data

2.3 Data Overview Statistics

3. Base Salary Distribution Analysis

3.1 Histogram of Base Salaries

Code Implementation:

Visualization:

3.2 Boxplot by Work Location

Code Implementation:

Visualization:

4. Agency Analysis

4.1 Top 10 Agencies by Average Salary

Code Implementation:

Visualization:

4.2 Agency Size vs Salary Relationship

Code Implementation:

Visualization:

5. Salary Composition by Leave Status

Code Implementation:

Visualization:

6. Time Trend Analysis

Code Implementation:

Visualization:

7. Comprehensive Dashboard

Code Implementation:

Visualization:

8. Key Findings and Recommendations

8.1 Major Findings

8.2 Recommendations

9. Appendix

9.1 Complete Summary Statistics

9.2 R Session Information

9.3 Data Quality Notes

How to Use This Report

For Data Analysts:

For Decision Makers:

For Technical Reviewers:

Analysis: Classification

Loading Data

Additional Data Processing for Classification Analysis

Train-test Split

Model Training and Evaluation

Analysis: Regression

Objective of Regression Analysis

Target Variable Construction

Feature Selection

Train–Test Split and Baseline Model

Ridge Regression Model

Model Interpretation: Key Predictors

Error Analysis

Discussion

Conclusion