Introduction

This report uses Stack Overflow 2025 developer survey data to study developers’ job satisfaction. We apply the multi-classification method to predict the job satisfaction of developers into three orderly categories based on demographic characteristics, occupational characteristics and technology-related characteristics. Understanding these drivers is of great value to developers, employers, and human resources researchers working in rapidly developing technological environments.

Framing & Dataset Selection

Research Question

Can we accurately predict the developer’s job satisfaction (dissatisfied, neutral or satisfied) based on his professional background, working conditions and technical usage patterns?

Motivation

The job satisfaction of software developers affects employee retention, mental health and productivity. We adopt the three-level classification method to achieve a balance between distinguishing different categories and ensuring sufficient sample size, thus achieving stable modelling. Employers can use these insights to develop interventions and identify key working conditions, while developers can understand the relationship between career choice and satisfaction. In view of the impact of artificial intelligence-assisted development, this analysis is particularly timely.

Consider & Gather Data

Data Source

The data set used is the Stack Overflow Annual Developer Survey 2025, which can be publicly accessed: https://survey.stackoverflow.co/. The survey collected feedback from global developers, including demographic information, educational background, employment, programming language, tools and attitudes towards artificial intelligence. The target variable of this project is JobSat (overall job satisfaction), which was initially recorded as a value of 0 to 10, and we re-encoded it into a three-level ordered factor: Dissatisfied (0–6), Neutral (7–8) and Satisfied (9–10).

Dataset Justification

The dataset satisfies four of the five required criteria:

1. Large Size (>10,000 observations)

library(tidyverse)

df <- read_csv("survey_results_public.csv", show_col_types = FALSE)

cat(
  "Number of rows:", nrow(df), "\n",
  "Number of columns:", ncol(df), "\n"
)

## Number of rows: 49191 
##  Number of columns: 172

The data set contains 49,191 rows, far exceeding the threshold of 10,000 rows. In view of this scale, we will use hierarchical sampling to control the calculation cost during model training, and use cross-verification instead of a single training set-test set division to evaluate the model.

2. Messy (Missing Values)

cols_of_interest <- c(
  "JobSat", "Age", "EdLevel", "Employment", "RemoteWork",
  "WorkExp", "OrgSize", "ICorPM", "Country", "CompTotal",
  "AISelect", "AISent", "AIThreat", "DevType", "YearsCode"
)

miss_df <- df %>%
  select(all_of(cols_of_interest)) %>%
  summarise(across(everything(), ~mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Rate") %>%
  arrange(desc(Missing_Rate))

ggplot(miss_df, aes(x = reorder(Variable, Missing_Rate), y = Missing_Rate)) +
  geom_col(fill = "#e05c5c") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Missing Value Rate for Key Variables",
    x = NULL,
    y = "Proportion Missing"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.y = element_text(size = 6),      
    axis.text.x = element_text(size = 10),      
    plot.title = element_text(face = "bold", size = 12)
  )

The target variable JobSat is missing in 45.78% of the respondents. Some predictors also have a large number of missing values (for example, the missing value of CompTotal is about 49%, and the missing value of OrgSize is about 31%). Therefore, it is necessary to adopt a cautious interpolation strategy and conduct sensitivity analysis.

3. Complex (>100 Features, Mixed Variable Types)

num_features <- sum(sapply(df, is.numeric))
cat_features <- sum(sapply(df, is.character))

cat(
  "Total features:", ncol(df), "\n",
  "Numerical features:", num_features, "\n",
  "Categorical features:", cat_features, "\n"
)

## Total features: 172 
##  Numerical features: 53 
##  Categorical features: 119

The data set contains 172 features - 53 numerical features and 119 subtype features - which brings significant dimensional challenges. Many classification columns are coded with multiple-choice responses (separated by semicolons), which need to be parsed and binaryed before modelling. Feature selection or dimension reduction (for example, principal component analysis (PCA) of encoded features, or using LASSO punishment regression) is crucial.

4. Multi-class Target (Class Distribution)

jobsat_dist <- df %>%
  filter(!is.na(JobSat)) %>%
  mutate(JobSat_Class = case_when(
    JobSat <= 6  ~ "Dissatisfied (0-6)",
    JobSat <= 8  ~ "Neutral (7-8)",
    JobSat <= 10 ~ "Satisfied (9-10)"
  )) %>%
  mutate(JobSat_Class = factor(JobSat_Class,
    levels = c("Dissatisfied (0-6)", "Neutral (7-8)", "Satisfied (9-10)"))) %>%
  count(JobSat_Class) %>%
  mutate(Proportion = n / sum(n))

ggplot(jobsat_dist, aes(x = JobSat_Class, y = Proportion, fill = JobSat_Class)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(n, "\n(", scales::percent(Proportion, accuracy = 0.1), ")")),
            vjust = -0.3, size = 3.2) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.55)) +
  scale_fill_manual(values = c("#d73027", "#fee090", "#1a9850")) +
  labs(
    title = "Distribution of Job Satisfaction (3-Class Recoding)",
    x = "Job Satisfaction Class",
    y = "Proportion"
  ) +
  theme_minimal(base_size = 11)

The original 0-10 points numerical JobSat score was recoded into three ordered classes: Dissatisfied (0-6), Neutral (7-8) and Satisfied (9-10). The number of samples in these three categories is about 7,386, 12,562 and 6,522, respectively, showing a moderate imbalance, with the neutral category dominating. We will solve this problem by using hierarchical sampling and macro average F1 scores as the main evaluation indicators.

Exploratory Data Analysis (EDA)

[To be completed by the group — include visualisations and summaries of key predictors, target variable, and relationships of interest.]

Project Planning

[To be completed by the group — outline modelling strategy, train/test split plan, algorithms to be trialled, and evaluation framework.]

References

Stack Overflow. (2025). Stack Overflow Annual Developer Survey 2025. https://survey.stackoverflow.co/

STAT5003 Group Project: Predicting Developer Job Satisfaction

Your Group Number