Data Scientist Salaries Visualization

1. Introduction

Nowadays technology is developing very rapidly, so that many new professions have emerged. One profession that is familiar to hear is the data scientist profession. Where this profession is tasked with collecting and cleaning various irregular data, analyzing several types of data in large quantities to find insight (insight or understanding of an issue). For companies, these insights will later become strategic recommendations that will be used by company shareholders to develop their business.

Not only is the data scientist profession in the data sector, there are data engineers, data analysts, machine learning engineers and several other related professions. The number of new professions that have emerged related to the big data sector has made this new profession a very promising job opportunity, this is not only the case in Indonesia, but throughout the world.

The salary of a data expert is also a highlight for job seekers. However, it can be ensured that there are several factors that greatly affect the amount of salary. One of the main factors is the work experience of a data scientist (entry-level to senior level). Another influencing factor is the location of the job, because each country has its own salary standards. Educational background and certification in the data sector also affect the amount of salary offered.

Until now, the profession as a data expert is a profession that is highly demanded by companies. Possibly because there are not many people who explore expertise in the field of data. So are you interested in learning to be a data expert?

2. Data Preparation

Project LBB (Programming for Data Visualization) this time I used data on the salary of the data expert profession from 2020 to 2024 in several countries in the world, the dataset was obtained from https://www.kaggle.com.

The following is a description of each column of the dataset about data scientist salary 2020-2024 in various countries. .

Column Name	Description
job_title	The job title or role associated with the reported salary.
experience_level	The level of experience of the individual.
employment_type	Indicates whether the employment is full-time, part-time, etc.
work_models	Describes different working models (remote, on-site, hybrid).
work_year	The specific year in which the salary information was recorded.
employee_residence	The residence location of the employee.
salary	The reported salary in the original currency.
salary_currency	The currency in which the salary is denominated.
salary_in_usd	The converted salary in US dollars.
company_location	The geographic location of the employing organization.
company_size	The size of the company, categorized by the number of employees.

a. Import & Read Data

The first step is to import the dataset using the read.csv() function.

data_salary <- read.csv("input_data/data_science_salaries.csv")
head(data_salary)

##        job_title experience_level employment_type work_models work_year
## 1  Data Engineer        Mid-level       Full-time      Remote      2024
## 2  Data Engineer        Mid-level       Full-time      Remote      2024
## 3 Data Scientist     Senior-level       Full-time      Remote      2024
## 4 Data Scientist     Senior-level       Full-time      Remote      2024
## 5   BI Developer        Mid-level       Full-time     On-site      2024
## 6   BI Developer        Mid-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 148100             USD        148100    United States
## 2      United States  98700             USD         98700    United States
## 3      United States 140032             USD        140032    United States
## 4      United States 100022             USD        100022    United States
## 5      United States 120000             USD        120000    United States
## 6      United States  62100             USD         62100    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium

b. Inspect Data

The next step is to investigate the imported dataset, because we want to observe the initial and final data of the data_salaries dataset. We use the head() and tail() functions.

head(data_salary)

##        job_title experience_level employment_type work_models work_year
## 1  Data Engineer        Mid-level       Full-time      Remote      2024
## 2  Data Engineer        Mid-level       Full-time      Remote      2024
## 3 Data Scientist     Senior-level       Full-time      Remote      2024
## 4 Data Scientist     Senior-level       Full-time      Remote      2024
## 5   BI Developer        Mid-level       Full-time     On-site      2024
## 6   BI Developer        Mid-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 148100             USD        148100    United States
## 2      United States  98700             USD         98700    United States
## 3      United States 140032             USD        140032    United States
## 4      United States 100022             USD        100022    United States
## 5      United States 120000             USD        120000    United States
## 6      United States  62100             USD         62100    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium

tail(data_salary)

##                     job_title experience_level employment_type work_models
## 6594 Principal Data Scientist     Senior-level       Full-time      Remote
## 6595       Staff Data Analyst      Entry-level        Contract      Hybrid
## 6596       Staff Data Analyst  Executive-level       Full-time     On-site
## 6597 Machine Learning Manager     Senior-level       Full-time      Hybrid
## 6598            Data Engineer        Mid-level       Full-time      Hybrid
## 6599           Data Scientist     Senior-level       Full-time     On-site
##      work_year employee_residence salary salary_currency salary_in_usd
## 6594      2020            Germany 130000             EUR        148261
## 6595      2020             Canada  60000             CAD         44753
## 6596      2020            Nigeria  15000             USD         15000
## 6597      2020             Canada 157000             CAD        117104
## 6598      2020            Austria  65000             EUR         74130
## 6599      2020            Austria  80000             EUR         91237
##      company_location company_size
## 6594          Germany       Medium
## 6595           Canada        Large
## 6596           Canada       Medium
## 6597           Canada        Large
## 6598          Austria        Large
## 6599          Austria        Small

c. Structure Data

To find out the suitable data type, it is checked first with the glimpse() function.

library(dplyr)

data_salary %>% 
  glimpse()

## Rows: 6,599
## Columns: 11
## $ job_title          <chr> "Data Engineer", "Data Engineer", "Data Scientist",…
## $ experience_level   <chr> "Mid-level", "Mid-level", "Senior-level", "Senior-l…
## $ employment_type    <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
## $ work_models        <chr> "Remote", "Remote", "Remote", "Remote", "On-site", …
## $ work_year          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <chr> "United States", "United States", "United States", …
## $ salary             <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency    <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd      <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location   <chr> "United States", "United States", "United States", …
## $ company_size       <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "…

d. Data Cleansing

Before doing the next step, there are some column types that must be converted to factor types first:

job_title,
experience_level,
employment_type,
work_models,
employee_residence,
salary_currency,
company_location,
company_size.

data_salary_clean <- 
data_salary %>% 
  mutate_at(.vars = c("job_title", "experience_level", "employment_type", 
                      "work_models", "employee_residence", "salary_currency",
                      "company_location", "company_size"), 
                    .funs = as.factor)

To double-check if the data type is correct, we can look at the top 10 data.

data_salary_clean %>% 
  glimpse()

## Rows: 6,599
## Columns: 11
## $ job_title          <fct> Data Engineer, Data Engineer, Data Scientist, Data …
## $ experience_level   <fct> Mid-level, Mid-level, Senior-level, Senior-level, M…
## $ employment_type    <fct> Full-time, Full-time, Full-time, Full-time, Full-ti…
## $ work_models        <fct> Remote, Remote, Remote, Remote, On-site, On-site, O…
## $ work_year          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <fct> United States, United States, United States, United…
## $ salary             <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency    <fct> USD, USD, USD, USD, USD, USD, USD, USD, USD, USD, U…
## $ salary_in_usd      <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location   <fct> United States, United States, United States, United…
## $ company_size       <fct> Medium, Medium, Medium, Medium, Medium, Medium, Med…

The data type of each column is correct, so the next step is to process this data.

3. Data Processing

The next step prepares the data that will be used for visualization. And each data is prepared to answer every existing business question.

a. Salary Trends in 2020-2024

A job seeker who works as a data scientist wants to know how much the average salary increases each year from various countries. Based on the data, we can see the trend of average salaries for data scientist, whether there is a significant increase or a decrease or no increase.

The first step is to select only the required columns from the data_salary_clean dataset. Then calculate the average salary per year (2020-2024) and add a column to display the label on the line plot that will be created.

library(dplyr)
library(glue)
library(scales)
# Create average salary by work year

average_salary_year <- 
  data_salary_clean %>% 
    select(work_year, salary_in_usd) %>% 
    group_by(work_year) %>%
    summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE)) %>%
    mutate(label = glue("{comma(avg_salary)} USD" ))

average_salary_year

## # A tibble: 5 × 3
##   work_year avg_salary label      
##       <int>      <dbl> <glue>     
## 1      2020    102251. 102,251 USD
## 2      2021     99501. 99,501 USD 
## 3      2022    131789. 131,789 USD
## 4      2023    150791. 150,791 USD
## 5      2024    153124. 153,124 USD

The average_salary_year data is ready for visualization using line plot.

b. Top 10 Job Title with Highest Salaries .

Jobs in the data field have various job titles, with different salary levels. To find out the 10 highest paid job titles, we can select some required columns. The required columns are job_title and salary_in_usd. Then calculate the average salary based on job_title.

# Prepare highest salary based on job_title & experience_level

salary_job <- 
  data_salary_clean %>% 
    select(job_title, experience_level, salary_in_usd) %>% 
    group_by(job_title, experience_level) %>% 
    summarise(avg_job = mean(salary_in_usd), .groups = 'drop') %>%
    mutate(label = glue("{comma(avg_job)} USD" ))
salary_job

## # A tibble: 279 × 4
##    job_title          experience_level avg_job label        
##    <fct>              <fct>              <dbl> <glue>       
##  1 AI Architect       Executive-level  215936  215,936.0 USD
##  2 AI Architect       Senior-level     233850  233,850.0 USD
##  3 AI Developer       Entry-level      110120. 110,119.5 USD
##  4 AI Developer       Mid-level        138294. 138,294.3 USD
##  5 AI Developer       Senior-level     162771. 162,770.7 USD
##  6 AI Engineer        Entry-level       28296. 28,296.5 USD 
##  7 AI Engineer        Mid-level        152988. 152,988.2 USD
##  8 AI Engineer        Senior-level     176706. 176,705.9 USD
##  9 AI Product Manager Senior-level     120000  120,000.0 USD
## 10 AI Programmer      Entry-level       56859. 56,858.8 USD 
## # ℹ 269 more rows

To find out the 10 positions that have the highest salary, we can use the order() function to sort the salaries from highest to lowest and select the top 10.

# to sort the salaries with the order() function
salary_job <- salary_job[order(salary_job$avg_job, decreasing = T),]
  
top10_job <- head(salary_job,10)
top10_job

## # A tibble: 10 × 4
##    job_title                      experience_level avg_job label        
##    <fct>                          <fct>              <dbl> <glue>       
##  1 Principal Data Scientist       Executive-level  416000  416,000.0 USD
##  2 Analytics Engineering Manager  Senior-level     399880  399,880.0 USD
##  3 Data Science Tech Lead         Senior-level     375000  375,000.0 USD
##  4 Managing Director Data Science Executive-level  280000  280,000.0 USD
##  5 AWS Data Architect             Mid-level        258000  258,000.0 USD
##  6 Deep Learning Engineer         Senior-level     254706  254,706.0 USD
##  7 Cloud Data Architect           Senior-level     250000  250,000.0 USD
##  8 AI Architect                   Senior-level     233850  233,850.0 USD
##  9 Machine Learning Scientist     Mid-level        226131. 226,131.2 USD
## 10 Director of Data Science       Executive-level  225244. 225,244.3 USD

The top10_job dataset is ready to be visualized using bar plot.

c. The Highest Salaries from Various Company Locations.

Every country has different salary standards for jobs in the field of data science. Some countries in the world have high average salary standards. To find out the countries that are willing to hire with high salary standards, we can start preparing the data. The first step is to select the columns to be used which are location_company and salary_in_usd. Calculate the average salary of each country.

grouped_loc <- 
  data_salary_clean %>%
    select(company_location, salary_in_usd) %>% 
    group_by(company_location) %>%
    summarise(avg_salary = mean(salary_in_usd)) %>% 
    mutate(label = glue("{comma(avg_salary)} USD" ))
grouped_loc

## # A tibble: 75 × 3
##    company_location       avg_salary label        
##    <fct>                       <dbl> <glue>       
##  1 Algeria                   100000  100,000.0 USD
##  2 Andorra                    50745  50,745.0 USD 
##  3 Argentina                  62000  62,000.0 USD 
##  4 Armenia                    50000  50,000.0 USD 
##  5 Australia                 114673. 114,673.4 USD
##  6 Austria                    71355. 71,354.8 USD 
##  7 Bahamas                    45555  45,555.0 USD 
##  8 Belgium                    76865. 76,864.8 USD 
##  9 Bosnia and Herzegovina    120000  120,000.0 USD
## 10 Brazil                     58569. 58,569.1 USD 
## # ℹ 65 more rows

Then we ranked the 10 salaries with the highest value.

# ranking using order() function
grouped_loc <- grouped_loc[order(grouped_loc$avg_salary, decreasing = T),]
  
top10_loc <- head(grouped_loc,10)
top10_loc

## # A tibble: 10 × 3
##    company_location       avg_salary label        
##    <fct>                       <dbl> <glue>       
##  1 Qatar                     300000  300,000.0 USD
##  2 Israel                    217332  217,332.0 USD
##  3 Puerto Rico               167500  167,500.0 USD
##  4 United States             157073. 157,073.1 USD
##  5 New Zealand               151634. 151,634.3 USD
##  6 Canada                    139833. 139,832.8 USD
##  7 Saudi Arabia              134999  134,999.0 USD
##  8 Ukraine                   121333. 121,333.3 USD
##  9 Bosnia and Herzegovina    120000  120,000.0 USD
## 10 Australia                 114673. 114,673.4 USD

The top10_locdataset is ready to be visualized using bar plot

d. Salary Distribution by Experience Level & Company Size.

We want to analyze the average standard salary given to data scientists based on the size of a company and a person’s work experience level. Is there a difference in salary from the same level of work experience but different company levels? To do the analysis we need 3 columns experience_level, company_size and salary_in_usd. And calculate the average salary based on experience_level & company_size.

exp_level <- 
  data_salary_clean %>% 
    select(experience_level, company_size, salary_in_usd) %>% 
    group_by(experience_level, company_size) %>%
    summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = 'drop') %>% 
    mutate(label = glue("{comma(avg_salary)} USD" ))
exp_level

## # A tibble: 12 × 4
##    experience_level company_size avg_salary label      
##    <fct>            <fct>             <dbl> <glue>     
##  1 Entry-level      Large            74603. 74,603 USD 
##  2 Entry-level      Medium           89223. 89,223 USD 
##  3 Entry-level      Small            68486. 68,486 USD 
##  4 Executive-level  Large           187994. 187,994 USD
##  5 Executive-level  Medium          190563. 190,563 USD
##  6 Executive-level  Small           169172. 169,172 USD
##  7 Mid-level        Large           100625. 100,625 USD
##  8 Mid-level        Medium          123304. 123,304 USD
##  9 Mid-level        Small            73770. 73,770 USD 
## 10 Senior-level     Large           150557. 150,557 USD
## 11 Senior-level     Medium          163532. 163,532 USD
## 12 Senior-level     Small           110274. 110,274 USD

The exp_level dataset is ready to be visualized using multivariate plot

4. Data Visualization

a. Salary Trends Over Time (2020-2024) Line Plot

We want to see the trend of the average salary of a data scientist per year from 2020-2024. For that, we use a line plot as a visualization. Create a line chart showing how average salaries have changed over the years. This can help identify trends and fluctuations. Use the work_year on the x-axis and the average salary on the y-axis.

library(ggplot2)
# Create a line plot Salary Trends Over Time (by Work Year)

plot1 <- 
ggplot(average_salary_year, 
       aes(x = work_year, 
           y = avg_salary)) +
  
  geom_line(color = "red", size = 1) +
  
  geom_point(color = "black", size = 3) +
  
  labs(title = "Average Salary Trends in 2020-2024",
       x = "Year",
       y = "Average Salary in USD")+
  
  theme_minimal() +
  
  theme(legend.position = "none")+
  
  geom_text(aes(label = label), size = 3, nudge_x = 0.3 )+
  
  scale_y_continuous(labels = comma)  

plot1

Insight :

The average salary in 2020 was recorded at 102,251 USD, but in 2021 the average salary decreased to 99,501 USD.
There was a significant increase in average salary from 2021 to 2022 which became 131,789 USD.
The fluctuations of the average salary as a data expert generally increases from year to year.

b. Top 10 Job Title with Highest Salaries 2020-2024

We want to analyze the highest average salary based on the job title and experience level of the data experts. Is the highest average salary only for employees who have a high level of experience or not? To visualize this analysis, I used a bar-plot. The goal is to rank the highest average salary by job title and experience level.We use the avg_job on the x-axis and job_title on the y-axis.

plot2 <- 
  ggplot(data = top10_job,
         mapping = aes(x = avg_job,
                       y = reorder(job_title,avg_job),
                       fill = experience_level))+
  
  geom_col(aes(fill=experience_level))+

  labs(title = "Highest Salaries 2020-2024", 
       subtitle = "Top 10 Job Title with Highest Salaries", 
       x = "Salary in USD",
       y = "Job Title",
       fill = "Experience Level") +
  
  theme_minimal() +
  
  theme(legend.position = "right")+
  
  geom_text(aes(label = label), size = 2.5, nudge_x = 10000)+
  
  scale_x_continuous(labels = comma)                     

plot2

Insight :

The highest average salary is 416,000 USD for Principal Data Scientists whose experience level is also high (executive level).
From the 10 highest salaries, it turns out that there are 2 positions, such as AWS Data Architect and Machine Learning Scientist, which have experience levels as mid-level.

c. Top 10 Highest Salaries by Company Location 2020-2024

To find out the countries that are willing to hire with high salary standards, which is sorted into 10 countries that have the highest average salary standard. To visualize this, I used a bar-plot to easily rank the 10 countries. We use the avg_salary on the x-axis and company_location on the y-axis.

plot3 <- 
ggplot(top10_loc, aes(x = avg_salary,
                        y = reorder(company_location,avg_salary),
                      fill = avg_salary)) +
  
  geom_bar(stat = "identity") +
  
  scale_fill_gradient(low = "red", high = "green")+  
  
  labs(title = "Highest Salaries 2020-2024", 
       subtitle = "Top 10 Company Location", 
       x = "Salary in USD",
       y = "Company Location") +
  
  theme_minimal() +
  
  theme(legend.position = "none")+
  
  geom_text(aes(label = label), size = 2.5, nudge_x = 25000)+
  
  scale_x_continuous(labels = comma)

plot3

Insight :

Qatar is a country that provides the highest average salary standard of 300,000.0 USD.
Israel is the second country that provides the highest average salary standard of 217,332.0 USD.
There is a significant difference in the highest average salary from Australia as the 10th country with Qatar as the 1st country, amounting to 185,326.6 USD.
There is a notable difference between the average salary in Puerto Rico (167,500.0 USD) and Israel (217,332.0 USD) with a difference of 49,832 USD, and Israel (217,332.0 USD) and Qatar (300,000.0 USD) with a difference of 82,668 USD.

d. Salary Distribution by Experience Level and Company Size in 2020-2024

We want to analyze the average standard salary given to data scientists based on the size of a company and a person’s work experience level. Is there a difference in salary from the same level of work experience but different company levels? To visualize the average salary distribution for data scientists, I used multivariate by grouping avg_salary by experience_level and company_size.

# Create barplot for salary distribution by experience level & company size
plot4 <- 
  ggplot(exp_level, aes(x = company_size, y = avg_salary, fill = experience_level)) +
  
  geom_bar(stat = "identity", position = "dodge") +

  labs(x = "Company Size",
       y = "Salary (USD)",
       title = "Salaries Distribution by Company Size and Experience Level in 2020-2024",
       fill = "Experience Level") +

  theme_minimal()+

  scale_y_continuous(labels = comma)+
  
  theme(legend.position = "right")

plot4

Insight :

It is clear that the average standard salary for data scientists of each experience level is different, depending on the company size.
Companies with a medium scale, in fact, provide higher salaries than companies with a large scale at every level of experience level.
The average salary value of the highest experience level, which is executive-level in a medium company is 190,563 USD, which is greater than a large company of 187,994 USD. This also happens at other experience levels, which are entry-level, mid-level and senior-level.
In this case, it means that the high average salary is not based on a large-scale company.

Data Scientist Salaries Visualization

Intan M Sari

2024-07-01

1. Introduction

2. Data Preparation

a. Import & Read Data

b. Inspect Data

c. Structure Data

d. Data Cleansing

3. Data Processing

a. Salary Trends in 2020-2024

b. Top 10 Job Title with Highest Salaries .

c. The Highest Salaries from Various Company Locations.

d. Salary Distribution by Experience Level & Company Size.

4. Data Visualization

a. Salary Trends Over Time (2020-2024) Line Plot

b. Top 10 Job Title with Highest Salaries 2020-2024

c. Top 10 Highest Salaries by Company Location 2020-2024

d. Salary Distribution by Experience Level and Company Size in 2020-2024

5. References