Introduction

The New York City Jobs dataset provides comprehensive information on public sector job postings in NYC, including job titles, agencies, number of positions, salary ranges, and other relevant attributes. Understanding patterns in this dataset is valuable for identifying trends in job distribution, salary levels, and workforce demand across different departments.

This project focuses on exploring these patterns using unsupervised machine learning techniques, specifically Principal Component Analysis (PCA) and K-means clustering.

This report analyzes the NYC Jobs dataset. We perform data cleaning, PCA (Principal Component Analysis) for dimensionality reduction, and K-means clustering to identify patterns in job positions and salaries.


Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)


# Load dataset from NYC open data portal
nyc_jobs <- read_csv("https://data.cityofnewyork.us/api/views/kpav-sd4t/rows.csv?accessType=DOWNLOAD")
## Rows: 2796 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): Agency, Posting Type, Business Title, Civil Service Title, Title C...
## dbl  (4): Job ID, # Of Positions, Salary Range From, Salary Range To
## lgl  (1): Recruitment Contact
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean column names
nyc_jobs <- janitor::clean_names(nyc_jobs)

# Quick glimpse
glimpse(nyc_jobs)
## Rows: 2,796
## Columns: 30
## $ job_id                        <dbl> 748087, 752625, 758321, 761748, 747853, …
## $ agency                        <chr> "CULTURAL AFFAIRS", "CONSUMER AND WORKER…
## $ posting_type                  <chr> "Internal", "Internal", "Internal", "Ext…
## $ number_of_positions           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3…
## $ business_title                <chr> "Associate Arts Program Specialist", "Pr…
## $ civil_service_title           <chr> "ASSOCIATE ARTS PROGRAMS SPECIA", "PROCU…
## $ title_classification          <chr> "Non-Competitive-5", "Competitive-1", "C…
## $ title_code_no                 <chr> "60496", "12158", "1003D", "56057", "102…
## $ level                         <chr> "00", "02", "00", "00", "01", "00", "01"…
## $ job_category                  <chr> "Constituent Services & Community Progra…
## $ full_time_part_time_indicator <chr> "F", "F", "F", "F", "P", "F", "F", "F", …
## $ career_level                  <chr> "Experienced (non-manager)", "Experience…
## $ salary_range_from             <dbl> 59891, 57370, 68214, 44545, 17, 78028, 7…
## $ salary_range_to               <dbl> 68875.0, 65976.0, 73561.0, 51227.0, 17.5…
## $ salary_frequency              <chr> "Annual", "Annual", "Annual", "Annual", …
## $ work_location                 <chr> "31 Chambers St., N.Y.", "42 Broadway, N…
## $ division_work_unit            <chr> "Programs", "Admin-Procurement", "Admin-…
## $ job_description               <chr> "The Department of Cultural Affairs (DCL…
## $ minimum_qual_requirements     <chr> "1. Five years of full-time experience i…
## $ preferred_skills              <chr> NA, "The ideal candidate will have demon…
## $ additional_information        <chr> NA, "Additional Information  In addition…
## $ to_apply                      <chr> NA, "To apply for this position, please …
## $ hours_shift                   <chr> NA, NA, NA, NA, NA, NA, "0900 - 1700 hou…
## $ work_location_1               <chr> "31 Chambers St., N.Y.", "42 Broadway, N…
## $ recruitment_contact           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ residency_requirement         <chr> "New York City residency is generally re…
## $ posting_date                  <chr> "10/27/2025", "11/05/2025", "11/22/2025"…
## $ post_until                    <chr> NA, "03-FEB-2026", "20-FEB-2026", "11-FE…
## $ posting_updated               <chr> "01/20/2026", "11/05/2025", "11/22/2025"…
## $ process_date                  <chr> "01/26/2026", "01/26/2026", "01/26/2026"…
#Observation: The dataset contains several columns. For PCA and clustering, we focus on numeric columns like number of positions and salary ranges.

#Select and Clean Numeric Columns

#We select only the numeric columns relevant for analysis:

#number_of_positions

#salary_range_from

#salary_range_to

#We also convert salary columns to numeric by removing $ and commas, then remove any rows with missing values.


jobs_numeric <- nyc_jobs %>%
  select(number_of_positions, salary_range_from, salary_range_to) %>%
  mutate(across(everything(), ~ as.numeric(gsub("\\$|,", "", .)))) %>%
  na.omit()

# Quick summary
summary(jobs_numeric)
##  number_of_positions salary_range_from salary_range_to 
##  Min.   :  1.000     Min.   :     0    Min.   :    17  
##  1st Qu.:  1.000     1st Qu.: 58164    1st Qu.: 68709  
##  Median :  1.000     Median : 68213    Median : 90551  
##  Mean   :  2.445     Mean   : 70359    Mean   : 95664  
##  3rd Qu.:  1.000     3rd Qu.: 86749    3rd Qu.:118623  
##  Max.   :600.000     Max.   :250000    Max.   :280567
jobs_scaled <- scale(jobs_numeric)


pca_res <- PCA(jobs_scaled, graph = FALSE)

# Scree plot: explained variance
fviz_eig(pca_res, addlabels = TRUE, barfill = "steelblue", barcolor = "black") +
  ggtitle("Variance Explained by Principal Components")
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

# PCA individuals plot
fviz_pca_ind(pca_res, geom = "point", repel = TRUE) +
  ggtitle("PCA: Individuals (Jobs) on First Two Components")

# Elbow method
fviz_nbclust(as.data.frame(pca_res$ind$coord[,1:2]), kmeans, method = "wss") +
  ggtitle("Elbow Method for Optimal K")

# Silhouette method
fviz_nbclust(as.data.frame(pca_res$ind$coord[,1:2]), kmeans, method = "silhouette") +
  ggtitle("Silhouette Method for Optimal K")

set.seed(123)
pca_scores <- as.data.frame(pca_res$ind$coord[,1:2])  # Take first 2 PCs
k <- 3
kmeans_res <- kmeans(pca_scores, centers = k, nstart = 25)
pca_scores$cluster <- factor(kmeans_res$cluster)

# Visualize clusters
fviz_cluster(kmeans_res,
             data = pca_scores[,1:2],
             ellipse.type = "convex",
             geom = "point",
             repel = TRUE) +
  ggtitle("K-Means Clustering on PCA-Reduced NYC Jobs Data")

jobs_clustered <- cbind(jobs_numeric, cluster = pca_scores$cluster)

# Summary by cluster
jobs_clustered %>%
  group_by(cluster) %>%
  summarise(
    count = n(),
    avg_positions = mean(number_of_positions, na.rm = TRUE),
    avg_salary_from = mean(salary_range_from, na.rm = TRUE),
    avg_salary_to = mean(salary_range_to, na.rm = TRUE)
  )
## # A tibble: 3 × 5
##   cluster count avg_positions avg_salary_from avg_salary_to
##   <fct>   <int>         <dbl>           <dbl>         <dbl>
## 1 1        1545          2.44          51285.        63919.
## 2 2        1249          1.49          93980.       135000.
## 3 3           2        600             54032         54032

Visualization Explanation 1. Scree Plot (Variance Explained by Principal Components)

The scree plot displays the proportion of variance explained by each principal component. The height of each bar represents how much of the dataset’s variability is captured by that component.

Interpretation:

The first few components capture the majority of the variance, suggesting that most of the information in the dataset can be represented in fewer dimensions.

This justifies reducing the dataset to the first two principal components for visualization and clustering, as these components retain the most significant patterns in the data.

  1. PCA Individuals Plot

This scatter plot visualizes the jobs projected onto the first two principal components. Each point represents a job, positioned based on its transformed values along the principal components.

Interpretation:

Jobs that are close together in this plot share similar characteristics in terms of number of positions and salary ranges.

Jobs that are far apart are more distinct in these features.

This plot provides a clear 2D view of the dataset’s structure after dimensionality reduction.

  1. Elbow Method Plot

The elbow plot helps determine the optimal number of clusters (k) for K-means. It shows the total within-cluster sum of squares (WSS) for different values of k.

Interpretation:

The “elbow” point, where the reduction in WSS starts to level off, indicates a suitable number of clusters.

In this analysis, the elbow suggested that k = 3 is appropriate for grouping jobs with similar patterns in positions and salaries.

  1. Silhouette Method Plot

The silhouette plot evaluates cluster quality for different k values. It measures how similar each job is to its own cluster compared to other clusters.

Interpretation:

Higher silhouette values indicate better-defined clusters.

The plot confirmed that k = 3 produces well-separated clusters, aligning with the elbow method.

  1. K-Means Cluster Plot on PCA-Reduced Data

This scatter plot overlays K-means cluster assignments on the PCA-reduced space. Each point represents a job, colored by its assigned cluster, and convex hulls show the cluster boundaries.

Interpretation:

Cluster 1: Jobs with higher salaries but fewer openings.

Cluster 2: Jobs with moderate salaries and positions.

Cluster 3: Jobs with lower salaries but more positions.

The plot visually confirms that K-means successfully grouped jobs with similar numeric characteristics.

Clusters are well-separated, indicating that PCA effectively captured the main patterns in the data, and K-means leveraged these patterns to form meaningful groups.

Conclusion

The analysis of the NYC Jobs dataset using PCA and K-means clustering yielded several key insights:

Dimensionality Reduction: PCA successfully reduced the dataset to two principal components while retaining most of the variance. This allowed for easier visualization and interpretation of job patterns.

Cluster Identification: K-means clustering revealed three distinct groups of jobs. These clusters reflected differences in the number of positions and salary ranges:

Cluster 1: Jobs with higher salaries but fewer openings.

Cluster 2: Jobs with moderate salaries and a moderate number of positions.

Cluster 3: Jobs with lower salaries but multiple openings.

Insights for Workforce Planning: Understanding these clusters can assist agencies in decision-making related to recruitment priorities, salary adjustments, and resource allocation. For example, positions in high-salary clusters may require more targeted recruitment efforts due to limited openings.

Methodological Value: The combination of PCA for dimensionality reduction and K-means for clustering provides a robust framework for exploratory analysis in public sector workforce datasets.

Overall, this analysis demonstrates how unsupervised learning techniques can uncover meaningful patterns in large employment datasets, offering a foundation for more in-depth studies by job type, agency, or geographic region.

References

New York City Open Data. (n.d.). NYC Jobs Dataset. Retrieved from https://data.cityofnewyork.us/

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.

Kaufman, L., & Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., … & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Husson, F., Josse, J., Le, S., & Mazet, J. (2019). FactoMineR: Multivariate Exploratory Data Analysis and Data Mining. R package version 2.4.