Source: DataQuest. (2019). What
Is Data Science?
This project analyzes salary trends in the Data Science field using the Data Science Salaries 2025 dataset. The dataset includes over 3,000 observations and variables such as work year, experience level, employement type, jobe title, company location, and salary.
Quantitative variables: ‘salary_usd’, ‘remote_ratio’ Categorical variables: ‘experience_level’, ‘employment_type’, ‘company_location’, ‘job_title’
Data Source: The original source of this dataset is not specified in the file, and no README or documentation was provided, so I assume it was collected through web scraping of job boards or self-reported survey data by Saurabh Badole.
As a Latina immigrant and aspiring data scientist, understanding how factors like experience, location, and work arrangement influence salary is personally and professionally relevant. This project aims to uncover meaningful patterns in salary data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(readr)
#load dataset from local CSV file
setwd("~/Desktop/DATA/Data Visualization 110/Final Project")
data <- read_csv("DataScience_salaries_2025.csv")
## Rows: 93597 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): experience_level, employment_type, job_title, salary_currency, empl...
## dbl (4): work_year, salary, salary_in_usd, remote_ratio
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
## work_year experience_level employment_type job_title salary salary_currency
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 2025 MI FT Research Sc… 208000 USD
## 2 2025 MI FT Research Sc… 147000 USD
## 3 2025 SE FT Research Sc… 173000 USD
## 4 2025 SE FT Research Sc… 117000 USD
## 5 2025 MI FT AI Engineer 100000 USD
## 6 2025 MI FT AI Engineer 80000 USD
## # ℹ 5 more variables: salary_in_usd <dbl>, employee_residence <chr>,
## # remote_ratio <dbl>, company_location <chr>, company_size <chr>
#standardize column names and format data types
#renaming columns
colnames(data) <- tolower(colnames(data))
#clean and format columns
data <- data %>%
rename_with(tolower) %>%
mutate(
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
job_title = as.factor(job_title),
company_location = as.factor(company_location),
company_size = as.factor(company_size),
remote_ratio = as.numeric(remote_ratio)
)
#check for missing values
colSums(is.na(data))
## work_year experience_level employment_type job_title
## 0 0 0 0
## salary salary_currency salary_in_usd employee_residence
## 0 0 0 0
## remote_ratio company_location company_size
## 0 0 0
Salary vs Experience, Remote Ratio, and Company Size
model <- lm(salary_in_usd ~ experience_level + remote_ratio + company_size, data = data)
summary(model)
##
## Call:
## lm(formula = salary_in_usd ~ experience_level + remote_ratio +
## company_size, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189100 -47495 -10470 34968 690354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104891.795 1486.528 70.562 <2e-16 ***
## experience_levelEX 101318.588 1733.939 58.433 <2e-16 ***
## experience_levelMI 42975.083 847.217 50.725 <2e-16 ***
## experience_levelSE 73730.390 798.167 92.375 <2e-16 ***
## remote_ratio -152.868 5.576 -27.418 <2e-16 ***
## company_sizeM -2110.248 1333.674 -1.582 0.114
## company_sizeS -50180.916 4959.216 -10.119 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69490 on 93590 degrees of freedom
## Multiple R-squared: 0.1099, Adjusted R-squared: 0.1099
## F-statistic: 1926 on 6 and 93590 DF, p-value: < 2.2e-16
#diagnostic plot
par(mfrow = c(2,2))
plot(model)
Model Equation: \[ \text{Salary (USD)} = \beta_1(Experience\ Level) + \beta_2(Remote\ Ratio) + \beta_3(Company\ Size) + \varepsilon \]
Conclusion:
The adjusted r-squared is approximately 0.11, suggesting modest
predictive power. Experience level has the strongest effect (p <
0.001), especially at executive level.
ggplot(data, aes(x = experience_level, y = salary_in_usd, fill = employment_type)) +
geom_boxplot(alpha = 0.8) +
labs(title = "Salary by Experience and Employment Type",
x = "Experience Level", y = "Salary (USD)",
fill = "Employment Type",
caption = "Data source: Data Science Salaries 2025") +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Executive-level and full-time employees tend to earn more. Freelancers show high variance in salaries.
#country code to full name
country_names <- c(
"US" = "United States",
"IN" = "India",
"GB" = "United Kingdom",
"CA" = "Canada",
"DE" = "Germany"
)
#prepare data
bar_data <- data %>%
mutate(remote_ratio = factor(remote_ratio),
country_full = recode(company_location, !!!country_names)) %>%
group_by(country_full, remote_ratio) %>%
summarize(mean_salary = mean(salary_in_usd), .groups = "drop") %>%
filter(country_full %in% country_names)
overall_median <- median(data$salary_in_usd) #calculate overall median
bar_plot <- ggplot(bar_data, aes(x = country_full, y = mean_salary, fill = remote_ratio)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
geom_hline(yintercept = overall_median, color = "black", linetype = "dashed", size = 1) +
annotate("text", x = 1, y = overall_median + 5000, label = paste("Median Salary: $", round(overall_median, 0)),
color = "black", hjust = 0, size = 3.5) +
labs(
title = "Mean Salary by Country and Remote Work Level",
subtitle = "Red dashed line indicates overall median salary across dataset",
x = "Country",
y = "Mean Salary (USD)",
fill = "Remote Ratio (%)",
caption = "Data source: Data Science Salaries 2025"
) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
bar_plot
Specialized roles such as AI Architects and Machine Learning Engineers top the pay scale.
#define the trending job titles
trending_jobs <- c("AI/ML Engineer", "Data Engineer", "Data Scientist", "Data Analyst",
"Data Architect", "Business Intelligence Analyst", "Machine Learning Scientist",
"Big Data Engineer", "Statistician", "Data Storyteller")
#filter the dataset for these job titles
filtered_data <- data %>%
filter(job_title %in% trending_jobs)
#calculate median salary by year and job title
trending <- filtered_data %>%
group_by(work_year, job_title) %>%
summarize(median_salary = median(salary_in_usd), .groups = 'drop')
#create an interactive plot
plot_ly(trending, x = ~work_year, y = ~median_salary, type = 'scatter', mode = 'lines+markers',
color = ~job_title, colors = 'Dark2') %>%
layout(title = "Trending Data Science Job Titles Over Time (by Median Salary)",
xaxis = list(title = "Year"),
yaxis = list(title = "Median Salary (USD)"),
legend = list(title = list(text = "Job Title")))
The top trending job each year fluctuates, but all show an upward salary trend.
According to the U.S. Bureau of Labor Statistics (2023), data science roles are projected to grow by 36% over the next decade. This growth is largely driven by the expansion of remote work opportunities and the increasing need for specialized skills in areas such as artificial intelligence and machine learning. As companies continue to invest in digital transformation and automation, roles that involve advanced data capabilities are becoming more critical—and more lucrative. These trends contribute to a competitive job market in which specialization and adaptability are key to higher compensation and career advancement.
This project analyzed how factors such as experience level, employment type, and job title influence salaries within the data science field. The findings show that experience plays a significant role in determining salary, with executive-level professionals earning far more than their entry-level counterparts. Full-time positions tend to offer the most financial stability, and roles like “AI Architect” and “Machine Learning Scientist” consistently rank among the highest-paying job titles. While the dataset provided valuable insights, one of the main challenges was the lack of geocoded location data, which limited the ability to explore global salary disparities through mapping. In future work, incorporating spatial data, revealing how location affects compensation in the data science industry.
I wished to explore more global mapping, but location data wasn’t geocoded. Future work could include mapping salaries by region.