Whether you’ve heard that the tech industry pays high salaries or maybe you are considering a career in tech, what are the strongest predictors of a tech worker’s income? For this project, I examined global Data Science related salaries from a dataset originally compiled by aijobs.net, an organization that documents trends related to salary and compensation in the tech industry. I have a particular interest in how remote work, experience level, and year worked impact pay. It was important for me to analyse this topic since I’ll be joining the tech workforce and knowing what and how these factors correlate to pay will be helpful for my career planning.
Data Source: aijobs.net
Image Source: https://www.geeksforgeeks.org/ai-ml-ds/
knitr::include_graphics("~/Documents/Data 101:110/AIMLDS.png")
setwd("~/Documents/Data 101:110")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the data
df <- read_csv("salaries.csv")
## Rows: 88584 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): experience_level, employment_type, job_title, salary_currency, empl...
## dbl (4): work_year, salary, salary_in_usd, remote_ratio
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview dataset
head(df)
## # A tibble: 6 × 11
## work_year experience_level employment_type job_title salary salary_currency
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 2025 MI FT Customer Su… 57000 EUR
## 2 2025 SE FT Engineer 165000 USD
## 3 2025 SE FT Engineer 109000 USD
## 4 2025 SE FT Applied Sci… 294000 USD
## 5 2025 SE FT Applied Sci… 137600 USD
## 6 2025 EN FT Data Analyst 82000 USD
## # ℹ 5 more variables: salary_in_usd <dbl>, employee_residence <chr>,
## # remote_ratio <dbl>, company_location <chr>, company_size <chr>
max(df$salary_in_usd)
## [1] 8e+05
Key Variables: -experience_level, remote_ratio, job_title, salary_in_usd, work_year
model <- lm(salary_in_usd ~ remote_ratio + experience_level + work_year, data = df)
summary(model)
##
## Call:
## lm(formula = salary_in_usd ~ remote_ratio + experience_level +
## work_year, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176908 -47448 -9987 35366 686229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.004e+06 7.663e+05 -11.75 <2e-16 ***
## remote_ratio -1.509e+02 5.762e+00 -26.19 <2e-16 ***
## experience_levelEX 1.009e+05 1.777e+03 56.77 <2e-16 ***
## experience_levelMI 4.258e+04 8.676e+02 49.08 <2e-16 ***
## experience_levelSE 7.424e+04 8.178e+02 90.78 <2e-16 ***
## work_year 4.499e+03 3.786e+02 11.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69310 on 88578 degrees of freedom
## Multiple R-squared: 0.1117, Adjusted R-squared: 0.1116
## F-statistic: 2227 on 5 and 88578 DF, p-value: < 2.2e-16
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)
salary_in_usd=β0​+β1​(remote_ratio)+β2​(experience_level)+β3​(work_year)+ϵ
Adjusted R²: 0.1116
Key Findings:
-Experience Level (EX, MI, SE): Strong positive impact on salary
-Remote Ratio: Small negative effect
-Work Year: Slight increase per year (approx. $4,499 per year)
df <- df |>
mutate(experience_level = recode(experience_level,
"EN" = "Entry-Level",
"MI" = "Mid-Level",
"SE" = "Senior-Level",
"EX" = "Executive"))
p1 <- ggplot(df, aes(x = experience_level, y = salary_in_usd, fill = experience_level)) +
geom_boxplot() +
labs(title = "Salary Distribution by Experience Level",
x = "Experience Level",
y = "Salary (USD)",
caption = "Data source: aijobs.net",
fill = "Level") +
theme_minimal() +
scale_fill_brewer(palette = "Dark2") +
theme(legend.position = "bottom")
p1
Salaries increase with experience: Executive roles have the highest median salaries, while Entry-Level roles have the lowest.
The Executive level shows a wider salary range and more high-end outliers, indicating greater variation at the top.
Overlapping ranges at the Mid and Senior levels suggest some salary inconsistencies, possibly due to job function or industry differences.
p2 <- ggplot(df, aes(x = factor(remote_ratio), y = salary_in_usd, fill = factor(remote_ratio))) +
geom_violin(trim = FALSE) +
labs(title = "Remote Ratio and Its Impact on Salary",
x = "Remote Work Percentage",
y = "Salary (USD)",
caption = "Data source: aijobs.net",
fill = "Remote %") +
theme_classic() +
scale_fill_manual(values = c("0" = "#1f78b4", "50" = "#33a02c", "100" = "#e31a1c")) +
theme(legend.position = "bottom")
p2
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(
data = df,
x = ~remote_ratio,
y = ~salary_in_usd,
type = 'scatter',
mode = 'markers',
color = ~experience_level,
text = ~paste("Title:", job_title),
marker = list(size = 8)
) |>
layout(
title = "Salary vs Remote Ratio by Experience Level",
xaxis = list(title = "Remote Work Ratio (%)"),
yaxis = list(title = "Salary in USD"),
legend = list(title = list(text = "<b>Experience</b>"))
)
100% Remote worker environments tend to earn higher salaries than 50% environments, based on a higher median and wider spread.
Fully remote roles show more outliers on the high end, likely tied to global hiring and competition for remote tech talent.
This suggests that remote flexibility may provide access to higher-paying jobs, supporting the trend of remote work as a competitive benefit, while also showing that either a 100% or 0% remote ratio has the higher salaries.
The dataset offers an opportunity to track technology salaries from all over the world, with fields for the job title, salary, experience level, company size, remote ratio, and employment type. I clearly had to filter as I was only interested in full-time roles, and I converted experience level, remote ratio, employment type, and salary above into factors. I made use of functions like filter(), mutate(), and select() from dplyr in the data wrangling. I chose this dataset because I want to pursue a career in tech and though it would be beneficial to see how much experience and the opportunity to be remote, impacted salaries.
The boxplot shows the income gap clearly in terms of experience levels, and the scatterplot shows high-remote jobs seem to be better paid. I was surprised by how significant the salary jump from senior to executive. I wanted to create a geographical heatmap of salaries by country with location data, however location data was too inconsistent to include.