Data Science Jobs Illustration
Source: DataQuest. (2019). What Is Data Science?

Introduction

Part 1: Topic, Variables, Methodology, and Rationale

This project analyzes salary trends in the Data Science field using the Data Science Salaries 2025 dataset. The dataset includes over 3,000 observations and variables such as work year, experience level, employement type, jobe title, company location, and salary.

Quantitative variables: ‘salary_usd’, ‘remote_ratio’ Categorical variables: ‘experience_level’, ‘employment_type’, ‘company_location’, ‘job_title’

Data Source: The original source of this dataset is not specified in the file, and no README or documentation was provided, so I assume it was collected through web scraping of job boards or self-reported survey data by Saurabh Badole.

As a Latina immigrant and aspiring data scientist, understanding how factors like experience, location, and work arrangement influence salary is personally and professionally relevant. This project aims to uncover meaningful patterns in salary data.

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(readr)

Load Dataset

#load dataset from local CSV file
setwd("~/Desktop/DATA/Data Visualization 110/Final Project")
data <- read_csv("DataScience_salaries_2025.csv")
## Rows: 93597 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): experience_level, employment_type, job_title, salary_currency, empl...
## dbl (4): work_year, salary, salary_in_usd, remote_ratio
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
## # A tibble: 6 × 11
##   work_year experience_level employment_type job_title    salary salary_currency
##       <dbl> <chr>            <chr>           <chr>         <dbl> <chr>          
## 1      2025 MI               FT              Research Sc… 208000 USD            
## 2      2025 MI               FT              Research Sc… 147000 USD            
## 3      2025 SE               FT              Research Sc… 173000 USD            
## 4      2025 SE               FT              Research Sc… 117000 USD            
## 5      2025 MI               FT              AI Engineer  100000 USD            
## 6      2025 MI               FT              AI Engineer   80000 USD            
## # ℹ 5 more variables: salary_in_usd <dbl>, employee_residence <chr>,
## #   remote_ratio <dbl>, company_location <chr>, company_size <chr>

Data Cleaning

#standardize column names and format data types
#renaming columns
colnames(data) <- tolower(colnames(data))

#clean and format columns
data <- data %>%
  rename_with(tolower) %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type = as.factor(employment_type),
    job_title = as.factor(job_title),
    company_location = as.factor(company_location),
    company_size = as.factor(company_size),
    remote_ratio = as.numeric(remote_ratio)
  )

#check for missing values
colSums(is.na(data))
##          work_year   experience_level    employment_type          job_title 
##                  0                  0                  0                  0 
##             salary    salary_currency      salary_in_usd employee_residence 
##                  0                  0                  0                  0 
##       remote_ratio   company_location       company_size 
##                  0                  0                  0

Linear Regression

Salary vs Experience, Remote Ratio, and Company Size

model <- lm(salary_in_usd ~ experience_level + remote_ratio + company_size, data = data)
summary(model)
## 
## Call:
## lm(formula = salary_in_usd ~ experience_level + remote_ratio + 
##     company_size, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189100  -47495  -10470   34968  690354 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        104891.795   1486.528  70.562   <2e-16 ***
## experience_levelEX 101318.588   1733.939  58.433   <2e-16 ***
## experience_levelMI  42975.083    847.217  50.725   <2e-16 ***
## experience_levelSE  73730.390    798.167  92.375   <2e-16 ***
## remote_ratio         -152.868      5.576 -27.418   <2e-16 ***
## company_sizeM       -2110.248   1333.674  -1.582    0.114    
## company_sizeS      -50180.916   4959.216 -10.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69490 on 93590 degrees of freedom
## Multiple R-squared:  0.1099, Adjusted R-squared:  0.1099 
## F-statistic:  1926 on 6 and 93590 DF,  p-value: < 2.2e-16
#diagnostic plot
par(mfrow = c(2,2))
plot(model)

Model Equation: \[ \text{Salary (USD)} = \beta_1(Experience\ Level) + \beta_2(Remote\ Ratio) + \beta_3(Company\ Size) + \varepsilon \]

Conclusion:
The adjusted r-squared is approximately 0.11, suggesting modest predictive power. Experience level has the strongest effect (p < 0.001), especially at executive level.

Visualization 1: Salary Distribution by Experience and Employment Type

ggplot(data, aes(x = experience_level, y = salary_in_usd, fill = employment_type)) +
  geom_boxplot(alpha = 0.8) +
  labs(title = "Salary by Experience and Employment Type",
       x = "Experience Level", y = "Salary (USD)",
       fill = "Employment Type",
       caption = "Data source: Data Science Salaries 2025") +
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

Executive-level and full-time employees tend to earn more. Freelancers show high variance in salaries.

Visualization 2: Salary by Company Location and Remote Ratio

#country code to full name
country_names <- c(
  "US" = "United States",
  "IN" = "India",
  "GB" = "United Kingdom",
  "CA" = "Canada",
  "DE" = "Germany"
)

#prepare data
bar_data <- data %>%
  mutate(remote_ratio = factor(remote_ratio),
         country_full = recode(company_location, !!!country_names)) %>%
  group_by(country_full, remote_ratio) %>%
  summarize(mean_salary = mean(salary_in_usd), .groups = "drop") %>%
  filter(country_full %in% country_names)

overall_median <- median(data$salary_in_usd) #calculate overall median

bar_plot <- ggplot(bar_data, aes(x = country_full, y = mean_salary, fill = remote_ratio)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
  geom_hline(yintercept = overall_median, color = "black", linetype = "dashed", size = 1) +
  annotate("text", x = 1, y = overall_median + 5000, label = paste("Median Salary: $", round(overall_median, 0)),
           color = "black", hjust = 0, size = 3.5) +
  labs(
    title = "Mean Salary by Country and Remote Work Level",
    subtitle = "Red dashed line indicates overall median salary across dataset",
    x = "Country",
    y = "Mean Salary (USD)",
    fill = "Remote Ratio (%)",
    caption = "Data source: Data Science Salaries 2025"
  ) +
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
bar_plot

Specialized roles such as AI Architects and Machine Learning Engineers top the pay scale.

Background Research

Part 2: Outside Research and Citations

According to the U.S. Bureau of Labor Statistics (2023), data science roles are projected to grow by 36% over the next decade. This growth is largely driven by the expansion of remote work opportunities and the increasing need for specialized skills in areas such as artificial intelligence and machine learning. As companies continue to invest in digital transformation and automation, roles that involve advanced data capabilities are becoming more critical—and more lucrative. These trends contribute to a competitive job market in which specialization and adaptability are key to higher compensation and career advancement.

Conclusion

This project analyzed how factors such as experience level, employment type, and job title influence salaries within the data science field. The findings show that experience plays a significant role in determining salary, with executive-level professionals earning far more than their entry-level counterparts. Full-time positions tend to offer the most financial stability, and roles like “AI Architect” and “Machine Learning Scientist” consistently rank among the highest-paying job titles. While the dataset provided valuable insights, one of the main challenges was the lack of geocoded location data, which limited the ability to explore global salary disparities through mapping. In future work, incorporating spatial data, revealing how location affects compensation in the data science industry.

💡 Challenges:

I wished to explore more global mapping, but location data wasn’t geocoded. Future work could include mapping salaries by region.

Bibliography