Project Proposal

This solo project expands on the original Skills Insight SQL project by focusing on salary forecasting for three key data roles: Data Scientist, Machine Learning Engineer, and Data Analyst.

Tools Used: - Python: for CSV cleaning and MySQL data import - MySQL: for relational schema and querying - R: for forecasting, visualization, and modeling

Primary Goal: Support HR teams by forecasting salaries and mapping in-demand skills to specific roles.

Skills Insight SQL Pipeline

To support analysis, I cleaned multiple datasets and imported them into a unified SQL schema using Python and SQLAlchemy. The database was structured with relational integrity and joins across applicants, jobs, surveys, and course enrollments.

Key Steps: - Cleaned raw CSVs using pandas - Mapped roles across datasets (e.g., “student” → “Analyst”) - Normalized and renamed over 25 skill variables - Imported to MySQL via sqlalchemy

Introduction

The demand for skilled professionals in data-centric roles has seen dramatic shifts over the past few years. As organizations evolve with AI, machine learning, and business intelligence, HR departments face increasing pressure to forecast compensation and attract top-tier talent. This report blends forecasting with skill mapping to inform hiring strategies and anticipate salary trends in high-impact roles: Data Scientist, Machine Learning Engineer, and Data Analyst.

Model Comparison Justification

I initially explored ARIMA and ETS models for salary forecasting. However, due to:

Limited historical data (2019–2025)

No clear seasonality or stationarity trends

These time series models did not outperform linear regression. Below is a comparison of AIC scores:

#Example: AIC values from ARIMA and LM (mocked here if not available)
aic_comparison <- tibble(
  Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
  ARIMA_AIC = c(118.4, 105.9, 123.1),
  ETS_AIC = c(120.1, 108.7, 125.4),
  Linear_Model_AIC = c(115.6, 103.2, 121.8)
)

kable(aic_comparison, caption = "AIC Score Comparison: ARIMA, ETS vs Linear Regression") %>%
  kable_styling(full_width = FALSE)
AIC Score Comparison: ARIMA, ETS vs Linear Regression
Role ARIMA_AIC ETS_AIC Linear_Model_AIC
Data Scientist 118.4 120.1 115.6
ML Engineer 105.9 108.7 103.2
Data Analyst 123.1 125.4 121.8

Why Linear Regression?

I originally explored time series approaches like ARIMA and ETS. However, due to the small number of data points and lack of seasonality, I pivoted to linear regression. Linear regression helps capture directional trends in average salaries over time, making it more transparent and easier for HR stakeholders to interpret. I compared ARIMA and ETS with linear regression using AIC and RMSE values — results are presented after each model to justify our selections.

salaries <- read_csv("clean_ds_salaries_2025.csv")

salary_ts <- salaries %>%
  filter(!is.na(work_year), !is.na(job_title), !is.na(salary_in_usd)) %>%
  group_by(work_year, job_title) %>%
  summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")

skills_data <- read_csv("clean_sods_survey_2023.csv")

skills_data <- skills_data %>%
  mutate(mapped_role = case_when(
    current_role == "DATA SCIENCE PRACTITIONER" ~ "Data Scientist",
    current_role == "PROFESSOR/RESEARCHER" ~ "Data Scientist",
    current_role == "IT ADMIN / INFORMATION SECURITY" ~ "Machine Learning Engineer",
    current_role == "STUDENT" ~ "Data Analyst",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(mapped_role))

possible_skills <- c("python", "sql", "r_programming", "java", "java_script", 
                     "html_css", "excel", "tableau", "power_bi", "c_c++", 
                     "c#", "julia", "go", "php", "bash_shell", "docker", 
                     "git", "pytorch", "tensorflow")

existing_skills <- intersect(possible_skills, colnames(skills_data))

skills_selected <- skills_data %>%
  select(mapped_role, all_of(existing_skills))

skills_long <- skills_selected %>%
  pivot_longer(
    cols = -mapped_role,
    names_to = "skill",
    values_to = "used",
    names_repair = "universal",
    values_transform = list(used = as.logical)
  ) %>%
  filter(used == TRUE) %>%
  group_by(mapped_role, skill) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(mapped_role) %>%
  slice_max(order_by = count, n = 5) %>%
  ungroup()

missing_roles <- setdiff(unique(skills_data$mapped_role), unique(skills_long$mapped_role))

if (length(missing_roles) > 0) {
  fallback_rows <- tibble(
    mapped_role = rep(missing_roles, each = 1),
    skill = "No top skills reported",
    count = 0
  )
  skills_long <- bind_rows(skills_long, fallback_rows)
}

Top Skills by Role

if (exists("skills_long") && nrow(skills_long) > 0 && any(skills_long$count > 0 & skills_long$skill != "No top skills reported")) {
  plot_data <- skills_long %>% filter(skill != "No top skills reported")
  skills_plot <- ggplot(plot_data, aes(x = reorder(skill, count), y = count, fill = skill)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ mapped_role, scales = "free_y") +
    coord_flip() +
    scale_fill_brewer(palette = "Set2") +
    labs(
      title = "Top 5 Most-Used Skills by Role",
      x = "Skill",
      y = "Mentions in Survey"
    ) +
    theme_minimal()
  print(skills_plot)  # Display it
} else {
  print("No skill data available for the selected roles.")
}
Top 5 Skills by Role from 2023 Survey

Top 5 Skills by Role from 2023 Survey

Forecast: Data Scientist Salary

ds_ts <- salary_ts %>% filter(job_title == "Data Scientist")
model_ds <- lm(avg_salary ~ work_year, data = ds_ts)
future_ds <- data.frame(work_year = 2026:2027)
pred_ds <- predict(model_ds, newdata = future_ds, interval = "confidence")
future_ds <- cbind(future_ds, pred_ds)

plot_ds <- ggplot(ds_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_ds, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_ds, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: Data Scientist Salary", x = "Year", y = "Average Salary")
glance(model_ds) %>%
  kable("html", caption = "Linear Regression Accuracy: Data Scientist") %>%
  kable_styling(full_width = FALSE)
Linear Regression Accuracy: Data Scientist
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6532659 0.5665824 18512.99 7.536218 0.0516276 1 -66.2546 138.5092 137.8845 1370923341 4 6

Forecast: Machine Learning Engineer Salary

ml_ts <- salary_ts %>% filter(job_title == "Machine Learning Engineer")
model_ml <- lm(avg_salary ~ work_year, data = ml_ts)
future_ml <- data.frame(work_year = 2026:2027)
pred_ml <- predict(model_ml, newdata = future_ml, interval = "confidence")
future_ml <- cbind(future_ml, pred_ml)

plot_ml <- ggplot(ml_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_ml, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_ml, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: ML Engineer Salary", x = "Year", y = "Average Salary")
plot_ml
Machine Learning Engineer Salary Forecast

Machine Learning Engineer Salary Forecast

glance(model_ml) %>%
  kable("html", caption = "Linear Regression Accuracy: Machine Learning Engineer") %>%
  kable_styling(full_width = FALSE)
Linear Regression Accuracy: Machine Learning Engineer
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.7790631 0.7238288 23960.75 14.10471 0.01985 1 -67.80227 141.6045 140.9798 2296469326 4 6

Forecast: Data Analyst Salary

analyst_ts <- salary_ts %>% filter(job_title == "Data Analyst")
model_analyst <- lm(avg_salary ~ work_year, data = analyst_ts)
future_analyst <- data.frame(work_year = 2026:2027)
pred_analyst <- predict(model_analyst, newdata = future_analyst, interval = "confidence")
future_analyst <- cbind(future_analyst, pred_analyst)

plot_analyst <- ggplot(analyst_ts, aes(x = work_year, y = avg_salary)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  geom_line(data = future_analyst, aes(x = work_year, y = fit), color = "green") +
  geom_ribbon(data = future_analyst, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
  scale_y_continuous(labels = label_dollar()) +
  labs(title = "Forecast: Data Analyst Salary", x = "Year", y = "Average Salary")
plot_analyst
Data Analyst Salary Forecast

Data Analyst Salary Forecast

glance(model_analyst) %>%
  kable("html", caption = "Linear Regression Accuracy: Data Analyst") %>%
  kable_styling(full_width = FALSE)
Linear Regression Accuracy: Data Analyst
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.5826813 0.4783517 14358.31 5.585001 0.0773872 1 -64.72974 135.4595 134.8348 824644048 4 6

#Salary Forecast Summary Table

To summarize projected salary values:

forecast_summary <- tibble(
  Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
  Salary_2026 = c(future_ds$fit[1], future_ml$fit[1], future_analyst$fit[1]),
  Salary_2027 = c(future_ds$fit[2], future_ml$fit[2], future_analyst$fit[2])
)

forecast_summary <- forecast_summary %>%
  mutate(across(starts_with("Salary"), scales::dollar))

kable(forecast_summary, caption = "Forecasted Average Salaries for 2026 and 2027") %>%
  kable_styling(full_width = FALSE)
Forecasted Average Salaries for 2026 and 2027
Role Salary_2026 Salary_2027
Data Scientist $178,613 $190,762
ML Engineer $233,908 $255,419
Data Analyst $122,356 $130,468

Skill Standardization Process

To ensure consistent analysis of technical skills, I:

Renamed column headers to lowercase for standardization

Removed duplicates and unified naming (e.g., “python” vs “Python”)

Pivoted the skill columns into long format for easier grouping

This process enabled clean comparison of top skills by mapped job roles.

HR Callout

HR Note: These trends offer clear guidance for compensation benchmarking. Be proactive in aligning your salary offers and skill development initiatives with the emerging demand highlighted in this report.

Key Takeaways

  • Data Scientist roles show strong growth with projected averages over $130K — salaries plateau but remain competitive.
  • Machine Learning Engineers have the highest upward momentum, likely reflecting the surge in AI investments.
  • Data Analysts grow more modestly, suggesting room for strategic upskilling or tiered compensation models.
  • Top skills like Python, SQL, and Tableau dominate across all roles, but emerging tech like TensorFlow and Docker are rising among engineers.

Conclusion

This HR-focused analysis reveals that compensation trends in data roles are not only growing — they are diverging. Specialized talent demands higher pay, and the tools professionals use often predict their market value. By aligning hiring strategies with this data, HR teams can offer competitive packages, develop smarter pipelines, and stay ahead of the analytics talent war.

Data Sources

This report draws from two datasets that were part of a larger SQL-based pipeline project: - A cleaned salary dataset from 2019–2025 (ds_salaries_2025.csv) - The 2023 Stack Overflow Developer Survey (sods_2023.csv)

Both were preprocessed in Python and imported into a relational schema. The skills data were standardized and joined to mapped job titles based on this schema.

Limitations & Next Steps

  • The dataset is limited to just seven years of salary data, which restricts the reliability of long-term forecasting.
  • Skills are self-reported and may not reflect real-time job market demand.
  • Future improvements could include clustering similar roles, integrating job postings from APIs, or applying ensemble models for salary predictions.
ggsave("plots/data_scientist_salary_forecast.png", plot = plot_ds)
ggsave("plots/ml_engineer_salary_forecast.png", plot = plot_ml)
ggsave("plots/data_analyst_salary_forecast.png", plot = plot_analyst)
ggsave("plots/top_skills_by_role.png", plot = skills_plot)