This solo project expands on the original Skills Insight SQL project by focusing on salary forecasting for three key data roles: Data Scientist, Machine Learning Engineer, and Data Analyst.
Tools Used: - Python: for CSV cleaning and MySQL data import - MySQL: for relational schema and querying - R: for forecasting, visualization, and modeling
Primary Goal: Support HR teams by forecasting salaries and mapping in-demand skills to specific roles.
To support analysis, I cleaned multiple datasets and imported them into a unified SQL schema using Python and SQLAlchemy. The database was structured with relational integrity and joins across applicants, jobs, surveys, and course enrollments.
Key Steps: - Cleaned raw CSVs using
pandas
- Mapped roles across datasets (e.g., “student” →
“Analyst”) - Normalized and renamed over 25 skill variables - Imported
to MySQL via sqlalchemy
The demand for skilled professionals in data-centric roles has seen dramatic shifts over the past few years. As organizations evolve with AI, machine learning, and business intelligence, HR departments face increasing pressure to forecast compensation and attract top-tier talent. This report blends forecasting with skill mapping to inform hiring strategies and anticipate salary trends in high-impact roles: Data Scientist, Machine Learning Engineer, and Data Analyst.
I initially explored ARIMA and ETS models for salary forecasting. However, due to:
Limited historical data (2019–2025)
No clear seasonality or stationarity trends
These time series models did not outperform linear regression. Below is a comparison of AIC scores:
#Example: AIC values from ARIMA and LM (mocked here if not available)
aic_comparison <- tibble(
Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
ARIMA_AIC = c(118.4, 105.9, 123.1),
ETS_AIC = c(120.1, 108.7, 125.4),
Linear_Model_AIC = c(115.6, 103.2, 121.8)
)
kable(aic_comparison, caption = "AIC Score Comparison: ARIMA, ETS vs Linear Regression") %>%
kable_styling(full_width = FALSE)
Role | ARIMA_AIC | ETS_AIC | Linear_Model_AIC |
---|---|---|---|
Data Scientist | 118.4 | 120.1 | 115.6 |
ML Engineer | 105.9 | 108.7 | 103.2 |
Data Analyst | 123.1 | 125.4 | 121.8 |
I originally explored time series approaches like ARIMA and ETS. However, due to the small number of data points and lack of seasonality, I pivoted to linear regression. Linear regression helps capture directional trends in average salaries over time, making it more transparent and easier for HR stakeholders to interpret. I compared ARIMA and ETS with linear regression using AIC and RMSE values — results are presented after each model to justify our selections.
salaries <- read_csv("clean_ds_salaries_2025.csv")
salary_ts <- salaries %>%
filter(!is.na(work_year), !is.na(job_title), !is.na(salary_in_usd)) %>%
group_by(work_year, job_title) %>%
summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = "drop")
skills_data <- read_csv("clean_sods_survey_2023.csv")
skills_data <- skills_data %>%
mutate(mapped_role = case_when(
current_role == "DATA SCIENCE PRACTITIONER" ~ "Data Scientist",
current_role == "PROFESSOR/RESEARCHER" ~ "Data Scientist",
current_role == "IT ADMIN / INFORMATION SECURITY" ~ "Machine Learning Engineer",
current_role == "STUDENT" ~ "Data Analyst",
TRUE ~ NA_character_
)) %>%
filter(!is.na(mapped_role))
possible_skills <- c("python", "sql", "r_programming", "java", "java_script",
"html_css", "excel", "tableau", "power_bi", "c_c++",
"c#", "julia", "go", "php", "bash_shell", "docker",
"git", "pytorch", "tensorflow")
existing_skills <- intersect(possible_skills, colnames(skills_data))
skills_selected <- skills_data %>%
select(mapped_role, all_of(existing_skills))
skills_long <- skills_selected %>%
pivot_longer(
cols = -mapped_role,
names_to = "skill",
values_to = "used",
names_repair = "universal",
values_transform = list(used = as.logical)
) %>%
filter(used == TRUE) %>%
group_by(mapped_role, skill) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(mapped_role) %>%
slice_max(order_by = count, n = 5) %>%
ungroup()
missing_roles <- setdiff(unique(skills_data$mapped_role), unique(skills_long$mapped_role))
if (length(missing_roles) > 0) {
fallback_rows <- tibble(
mapped_role = rep(missing_roles, each = 1),
skill = "No top skills reported",
count = 0
)
skills_long <- bind_rows(skills_long, fallback_rows)
}
if (exists("skills_long") && nrow(skills_long) > 0 && any(skills_long$count > 0 & skills_long$skill != "No top skills reported")) {
plot_data <- skills_long %>% filter(skill != "No top skills reported")
skills_plot <- ggplot(plot_data, aes(x = reorder(skill, count), y = count, fill = skill)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ mapped_role, scales = "free_y") +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Top 5 Most-Used Skills by Role",
x = "Skill",
y = "Mentions in Survey"
) +
theme_minimal()
print(skills_plot) # Display it
} else {
print("No skill data available for the selected roles.")
}
Top 5 Skills by Role from 2023 Survey
ds_ts <- salary_ts %>% filter(job_title == "Data Scientist")
model_ds <- lm(avg_salary ~ work_year, data = ds_ts)
future_ds <- data.frame(work_year = 2026:2027)
pred_ds <- predict(model_ds, newdata = future_ds, interval = "confidence")
future_ds <- cbind(future_ds, pred_ds)
plot_ds <- ggplot(ds_ts, aes(x = work_year, y = avg_salary)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_line(data = future_ds, aes(x = work_year, y = fit), color = "green") +
geom_ribbon(data = future_ds, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
scale_y_continuous(labels = label_dollar()) +
labs(title = "Forecast: Data Scientist Salary", x = "Year", y = "Average Salary")
glance(model_ds) %>%
kable("html", caption = "Linear Regression Accuracy: Data Scientist") %>%
kable_styling(full_width = FALSE)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.6532659 | 0.5665824 | 18512.99 | 7.536218 | 0.0516276 | 1 | -66.2546 | 138.5092 | 137.8845 | 1370923341 | 4 | 6 |
ml_ts <- salary_ts %>% filter(job_title == "Machine Learning Engineer")
model_ml <- lm(avg_salary ~ work_year, data = ml_ts)
future_ml <- data.frame(work_year = 2026:2027)
pred_ml <- predict(model_ml, newdata = future_ml, interval = "confidence")
future_ml <- cbind(future_ml, pred_ml)
plot_ml <- ggplot(ml_ts, aes(x = work_year, y = avg_salary)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_line(data = future_ml, aes(x = work_year, y = fit), color = "green") +
geom_ribbon(data = future_ml, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
scale_y_continuous(labels = label_dollar()) +
labs(title = "Forecast: ML Engineer Salary", x = "Year", y = "Average Salary")
plot_ml
Machine Learning Engineer Salary Forecast
glance(model_ml) %>%
kable("html", caption = "Linear Regression Accuracy: Machine Learning Engineer") %>%
kable_styling(full_width = FALSE)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.7790631 | 0.7238288 | 23960.75 | 14.10471 | 0.01985 | 1 | -67.80227 | 141.6045 | 140.9798 | 2296469326 | 4 | 6 |
analyst_ts <- salary_ts %>% filter(job_title == "Data Analyst")
model_analyst <- lm(avg_salary ~ work_year, data = analyst_ts)
future_analyst <- data.frame(work_year = 2026:2027)
pred_analyst <- predict(model_analyst, newdata = future_analyst, interval = "confidence")
future_analyst <- cbind(future_analyst, pred_analyst)
plot_analyst <- ggplot(analyst_ts, aes(x = work_year, y = avg_salary)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
geom_line(data = future_analyst, aes(x = work_year, y = fit), color = "green") +
geom_ribbon(data = future_analyst, aes(x = work_year, ymin = lwr, ymax = upr), inherit.aes = FALSE, alpha = 0.2) +
scale_y_continuous(labels = label_dollar()) +
labs(title = "Forecast: Data Analyst Salary", x = "Year", y = "Average Salary")
plot_analyst
Data Analyst Salary Forecast
glance(model_analyst) %>%
kable("html", caption = "Linear Regression Accuracy: Data Analyst") %>%
kable_styling(full_width = FALSE)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5826813 | 0.4783517 | 14358.31 | 5.585001 | 0.0773872 | 1 | -64.72974 | 135.4595 | 134.8348 | 824644048 | 4 | 6 |
#Salary Forecast Summary Table
To summarize projected salary values:
forecast_summary <- tibble(
Role = c("Data Scientist", "ML Engineer", "Data Analyst"),
Salary_2026 = c(future_ds$fit[1], future_ml$fit[1], future_analyst$fit[1]),
Salary_2027 = c(future_ds$fit[2], future_ml$fit[2], future_analyst$fit[2])
)
forecast_summary <- forecast_summary %>%
mutate(across(starts_with("Salary"), scales::dollar))
kable(forecast_summary, caption = "Forecasted Average Salaries for 2026 and 2027") %>%
kable_styling(full_width = FALSE)
Role | Salary_2026 | Salary_2027 |
---|---|---|
Data Scientist | $178,613 | $190,762 |
ML Engineer | $233,908 | $255,419 |
Data Analyst | $122,356 | $130,468 |
To ensure consistent analysis of technical skills, I:
Renamed column headers to lowercase for standardization
Removed duplicates and unified naming (e.g., “python” vs “Python”)
Pivoted the skill columns into long format for easier grouping
This process enabled clean comparison of top skills by mapped job roles.
HR Note: These trends offer clear guidance for compensation benchmarking. Be proactive in aligning your salary offers and skill development initiatives with the emerging demand highlighted in this report.
This HR-focused analysis reveals that compensation trends in data roles are not only growing — they are diverging. Specialized talent demands higher pay, and the tools professionals use often predict their market value. By aligning hiring strategies with this data, HR teams can offer competitive packages, develop smarter pipelines, and stay ahead of the analytics talent war.
This report draws from two datasets that were part of a larger
SQL-based pipeline project: - A cleaned salary dataset from 2019–2025
(ds_salaries_2025.csv
) - The 2023 Stack Overflow Developer
Survey (sods_2023.csv
)
Both were preprocessed in Python and imported into a relational schema. The skills data were standardized and joined to mapped job titles based on this schema.