Salary Forecasting for Data Roles: HR Insights from Time Series Analysis

Project Proposal

This solo project expands on the original Skills Insight SQL project by focusing on salary forecasting for three key data roles: Data Scientist, Machine Learning Engineer, and Data Analyst.

Tools Used: - Python: for CSV cleaning and MySQL data import - MySQL: for relational schema and querying - R: for forecasting, visualization, and modeling

Primary Goal: Support HR teams by forecasting salaries and mapping in-demand skills to specific roles.

Skills Insight SQL Pipeline

To support analysis, I cleaned multiple datasets and imported them into a unified SQL schema using Python and SQLAlchemy. The database was structured with relational integrity and joins across applicants, jobs, surveys, and course enrollments.

Key Steps: - Cleaned raw CSVs using pandas - Mapped roles across datasets (e.g., “student” → “Analyst”) - Normalized and renamed over 25 skill variables - Imported to MySQL via sqlalchemy

Introduction

The demand for skilled professionals in data-centric roles has seen dramatic shifts over the past few years. As organizations evolve with AI, machine learning, and business intelligence, HR departments face increasing pressure to forecast compensation and attract top-tier talent. This report blends forecasting with skill mapping to inform hiring strategies and anticipate salary trends in high-impact roles: Data Scientist, Machine Learning Engineer, and Data Analyst.

Model Comparison Justification

I initially explored ARIMA and ETS models for salary forecasting. However, due to:

Limited historical data (2019–2025)

No clear seasonality or stationarity trends

These time series models did not outperform linear regression. Below is a comparison of AIC scores:

#Example: AIC values from ARIMA and LM (mocked here if not available)
aic_comparison <- tibble(
  Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
  ARIMA_AIC = c(118.4, 105.9, 123.1),
  ETS_AIC = c(120.1, 108.7, 125.4),
  Linear_Model_AIC = c(115.6, 103.2, 121.8)
)

kable(aic_comparison, caption = "AIC Score Comparison: ARIMA, ETS vs Linear Regression") %>%
  kable_styling(full_width = FALSE)

AIC Score Comparison: ARIMA, ETS vs Linear Regression
Role	ARIMA_AIC	ETS_AIC	Linear_Model_AIC
Data Scientist	118.4	120.1	115.6
ML Engineer	105.9	108.7	103.2
Data Analyst	123.1	125.4	121.8

Why Linear Regression?

I originally explored time series approaches like ARIMA and ETS. However, due to the small number of data points and lack of seasonality, I pivoted to linear regression. Linear regression helps capture directional trends in average salaries over time, making it more transparent and easier for HR stakeholders to interpret. I compared ARIMA and ETS with linear regression using AIC and RMSE values — results are presented after each model to justify our selections.

salaries <- read_csv("clean_ds_salaries_2025.csv")

salary_ts <- salaries %>%
  filter(!is.na(work_year), !is.na(job_title), !is.na(salary_in_usd)) %>%
  group_by(work_year, job_title) %>%
  summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")

skills_data <- read_csv("clean_sods_survey_2023.csv")

skills_data <- skills_data %>%
  mutate(mapped_role = case_when(
    current_role == "DATA SCIENCE PRACTITIONER" ~ "Data Scientist",
    current_role == "PROFESSOR/RESEARCHER" ~ "Data Scientist",
    current_role == "IT ADMIN / INFORMATION SECURITY" ~ "Machine Learning Engineer",
    current_role == "STUDENT" ~ "Data Analyst",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(mapped_role))

possible_skills <- c("python", "sql", "r_programming", "java", "java_script", 
                     "html_css", "excel", "tableau", "power_bi", "c_c++", 
                     "c#", "julia", "go", "php", "bash_shell", "docker", 
                     "git", "pytorch", "tensorflow")

existing_skills <- intersect(possible_skills, colnames(skills_data))

skills_selected <- skills_data %>%
  select(mapped_role, all_of(existing_skills))

skills_long <- skills_selected %>%
  pivot_longer(
    cols = -mapped_role,
    names_to = "skill",
    values_to = "used",
    names_repair = "universal",
    values_transform = list(used = as.logical)
  ) %>%
  filter(used == TRUE) %>%
  group_by(mapped_role, skill) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(mapped_role) %>%
  slice_max(order_by = count, n = 5) %>%
  ungroup()

missing_roles <- setdiff(unique(skills_data$mapped_role), unique(skills_long$mapped_role))

if (length(missing_roles) > 0) {
  fallback_rows <- tibble(
    mapped_role = rep(missing_roles, each = 1),
    skill = "No top skills reported",
    count = 0
  )
  skills_long <- bind_rows(skills_long, fallback_rows)
}

Top Skills by Role

if (exists("skills_long") && nrow(skills_long) > 0 && any(skills_long$count > 0 & skills_long$skill != "No top skills reported")) {
  plot_data <- skills_long %>% filter(skill != "No top skills reported")
  skills_plot <- ggplot(plot_data, aes(x = reorder(skill, count), y = count, fill = skill)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ mapped_role, scales = "free_y") +
    coord_flip() +
    scale_fill_brewer(palette = "Set2") +
    labs(
      title = "Top 5 Most-Used Skills by Role",
      x = "Skill",
      y = "Mentions in Survey"
    ) +
    theme_minimal()
  print(skills_plot)  # Display it
} else {
  print("No skill data available for the selected roles.")
}

Top 5 Skills by Role from 2023 Survey

Forecast: Data Scientist Salary

ds_ts <- salary_ts %>% filter(job_title == "Data Scientist")
model_ds <- lm(avg_salary ~ work_year, data = ds_ts)
future_ds <- data.frame(work_year = 2026:2027)
pred_ds <- predict(model_ds, newdata = future_ds, interval = "confidence")
future_ds <- cbind(future_ds, pred_ds)

plot_ds <- ggplot(ds_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_ds, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_ds, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: Data Scientist Salary", x = "Year", y = "Average Salary")

glance(model_ds) %>%
  kable("html", caption = "Linear Regression Accuracy: Data Scientist") %>%
  kable_styling(full_width = FALSE)

Linear Regression Accuracy: Data Scientist
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.6532659	0.5665824	18512.99	7.536218	0.0516276	1	-66.2546	138.5092	137.8845	1370923341	4	6

Forecast: Machine Learning Engineer Salary

ml_ts <- salary_ts %>% filter(job_title == "Machine Learning Engineer")
model_ml <- lm(avg_salary ~ work_year, data = ml_ts)
future_ml <- data.frame(work_year = 2026:2027)
pred_ml <- predict(model_ml, newdata = future_ml, interval = "confidence")
future_ml <- cbind(future_ml, pred_ml)

plot_ml <- ggplot(ml_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_ml, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_ml, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: ML Engineer Salary", x = "Year", y = "Average Salary")
plot_ml

Machine Learning Engineer Salary Forecast

glance(model_ml) %>%
  kable("html", caption = "Linear Regression Accuracy: Machine Learning Engineer") %>%
  kable_styling(full_width = FALSE)

Linear Regression Accuracy: Machine Learning Engineer
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.7790631	0.7238288	23960.75	14.10471	0.01985	1	-67.80227	141.6045	140.9798	2296469326	4	6

Forecast: Data Analyst Salary

analyst_ts <- salary_ts %>% filter(job_title == "Data Analyst")
model_analyst <- lm(avg_salary ~ work_year, data = analyst_ts)
future_analyst <- data.frame(work_year = 2026:2027)
pred_analyst <- predict(model_analyst, newdata = future_analyst, interval = "confidence")
future_analyst <- cbind(future_analyst, pred_analyst)

plot_analyst <- ggplot(analyst_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_analyst, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_analyst, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: Data Analyst Salary", x = "Year", y = "Average Salary")
plot_analyst

Data Analyst Salary Forecast

glance(model_analyst) %>%
  kable("html", caption = "Linear Regression Accuracy: Data Analyst") %>%
  kable_styling(full_width = FALSE)

Linear Regression Accuracy: Data Analyst
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.5826813	0.4783517	14358.31	5.585001	0.0773872	1	-64.72974	135.4595	134.8348	824644048	4	6

#Salary Forecast Summary Table

To summarize projected salary values:

forecast_summary <- tibble(
  Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
  Salary_2026 = c(future_ds$fit[1], future_ml$fit[1], future_analyst$fit[1]),
  Salary_2027 = c(future_ds$fit[2], future_ml$fit[2], future_analyst$fit[2])
)

forecast_summary <- forecast_summary %>%
  mutate(across(starts_with("Salary"), scales::dollar))

kable(forecast_summary, caption = "Forecasted Average Salaries for 2026 and 2027") %>%
  kable_styling(full_width = FALSE)

Forecasted Average Salaries for 2026 and 2027
Role	Salary_2026	Salary_2027
Data Scientist	$178,613	$190,762
ML Engineer	$233,908	$255,419
Data Analyst	$122,356	$130,468

Skill Standardization Process

To ensure consistent analysis of technical skills, I:

Renamed column headers to lowercase for standardization

Removed duplicates and unified naming (e.g., “python” vs “Python”)

Pivoted the skill columns into long format for easier grouping

This process enabled clean comparison of top skills by mapped job roles.

HR Callout

HR Note: These trends offer clear guidance for compensation benchmarking. Be proactive in aligning your salary offers and skill development initiatives with the emerging demand highlighted in this report.

Key Takeaways

Data Scientist roles show strong growth with projected averages over $130K — salaries plateau but remain competitive.
Machine Learning Engineers have the highest upward momentum, likely reflecting the surge in AI investments.
Data Analysts grow more modestly, suggesting room for strategic upskilling or tiered compensation models.
Top skills like Python, SQL, and Tableau dominate across all roles, but emerging tech like TensorFlow and Docker are rising among engineers.

Conclusion

This HR-focused analysis reveals that compensation trends in data roles are not only growing — they are diverging. Specialized talent demands higher pay, and the tools professionals use often predict their market value. By aligning hiring strategies with this data, HR teams can offer competitive packages, develop smarter pipelines, and stay ahead of the analytics talent war.

Data Sources

This report draws from two datasets that were part of a larger SQL-based pipeline project: - A cleaned salary dataset from 2019–2025 (ds_salaries_2025.csv) - The 2023 Stack Overflow Developer Survey (sods_2023.csv)

Both were preprocessed in Python and imported into a relational schema. The skills data were standardized and joined to mapped job titles based on this schema.

Limitations & Next Steps

The dataset is limited to just seven years of salary data, which restricts the reliability of long-term forecasting.
Skills are self-reported and may not reflect real-time job market demand.
Future improvements could include clustering similar roles, integrating job postings from APIs, or applying ensemble models for salary predictions.

ggsave("plots/data_scientist_salary_forecast.png", plot = plot_ds)
ggsave("plots/ml_engineer_salary_forecast.png", plot = plot_ml)
ggsave("plots/data_analyst_salary_forecast.png", plot = plot_analyst)
ggsave("plots/top_skills_by_role.png", plot = skills_plot)