1. Introduction

Nowadays technology is developing very rapidly, so that many new professions have emerged. One profession that is familiar to hear is the data scientist profession. Where this profession is tasked with collecting and cleaning various irregular data, analyzing several types of data in large quantities to find insight (insight or understanding of an issue). For companies, these insights will later become strategic recommendations that will be used by company shareholders to develop their business.

Not only is the data scientist profession in the data sector, there are data engineers, data analysts, machine learning engineers and several other related professions. The number of new professions that have emerged related to the big data sector has made this new profession a very promising job opportunity, this is not only the case in Indonesia, but throughout the world.

The salary of a data expert is also a highlight for job seekers. However, it can be ensured that there are several factors that greatly affect the amount of salary. One of the main factors is the work experience of a data scientist (entry-level to senior level). Another influencing factor is the location of the job, because each country has its own salary standards. Educational background and certification in the data sector also affect the amount of salary offered.

Until now, the profession as a data expert is a profession that is highly demanded by companies. Possibly because there are not many people who explore expertise in the field of data. So are you interested in learning to be a data expert?

2. Data Preparation

Project LBB (Programming for Data Visualization) this time I used data on the salary of the data expert profession from 2020 to 2024 in several countries in the world, the dataset was obtained from https://www.kaggle.com.

The following is a description of each column of the dataset about data scientist salary 2020-2024 in various countries. .

Column Name Description
job_title The job title or role associated with the reported salary.
experience_level The level of experience of the individual.
employment_type Indicates whether the employment is full-time, part-time, etc.
work_models Describes different working models (remote, on-site, hybrid).
work_year The specific year in which the salary information was recorded.
employee_residence The residence location of the employee.
salary The reported salary in the original currency.
salary_currency The currency in which the salary is denominated.
salary_in_usd The converted salary in US dollars.
company_location The geographic location of the employing organization.
company_size The size of the company, categorized by the number of employees.

a. Import & Read Data

The first step is to import the dataset using the read.csv() function.

data_salary <- read.csv("input_data/data_science_salaries.csv")
head(data_salary)
##        job_title experience_level employment_type work_models work_year
## 1  Data Engineer        Mid-level       Full-time      Remote      2024
## 2  Data Engineer        Mid-level       Full-time      Remote      2024
## 3 Data Scientist     Senior-level       Full-time      Remote      2024
## 4 Data Scientist     Senior-level       Full-time      Remote      2024
## 5   BI Developer        Mid-level       Full-time     On-site      2024
## 6   BI Developer        Mid-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 148100             USD        148100    United States
## 2      United States  98700             USD         98700    United States
## 3      United States 140032             USD        140032    United States
## 4      United States 100022             USD        100022    United States
## 5      United States 120000             USD        120000    United States
## 6      United States  62100             USD         62100    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium

b. Inspect Data

The next step is to investigate the imported dataset, because we want to observe the initial and final data of the data_salaries dataset. We use the head() and tail() functions.

head(data_salary)
##        job_title experience_level employment_type work_models work_year
## 1  Data Engineer        Mid-level       Full-time      Remote      2024
## 2  Data Engineer        Mid-level       Full-time      Remote      2024
## 3 Data Scientist     Senior-level       Full-time      Remote      2024
## 4 Data Scientist     Senior-level       Full-time      Remote      2024
## 5   BI Developer        Mid-level       Full-time     On-site      2024
## 6   BI Developer        Mid-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 148100             USD        148100    United States
## 2      United States  98700             USD         98700    United States
## 3      United States 140032             USD        140032    United States
## 4      United States 100022             USD        100022    United States
## 5      United States 120000             USD        120000    United States
## 6      United States  62100             USD         62100    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium
tail(data_salary)
##                     job_title experience_level employment_type work_models
## 6594 Principal Data Scientist     Senior-level       Full-time      Remote
## 6595       Staff Data Analyst      Entry-level        Contract      Hybrid
## 6596       Staff Data Analyst  Executive-level       Full-time     On-site
## 6597 Machine Learning Manager     Senior-level       Full-time      Hybrid
## 6598            Data Engineer        Mid-level       Full-time      Hybrid
## 6599           Data Scientist     Senior-level       Full-time     On-site
##      work_year employee_residence salary salary_currency salary_in_usd
## 6594      2020            Germany 130000             EUR        148261
## 6595      2020             Canada  60000             CAD         44753
## 6596      2020            Nigeria  15000             USD         15000
## 6597      2020             Canada 157000             CAD        117104
## 6598      2020            Austria  65000             EUR         74130
## 6599      2020            Austria  80000             EUR         91237
##      company_location company_size
## 6594          Germany       Medium
## 6595           Canada        Large
## 6596           Canada       Medium
## 6597           Canada        Large
## 6598          Austria        Large
## 6599          Austria        Small

c. Structure Data

To find out the suitable data type, it is checked first with the glimpse() function.

library(dplyr)

data_salary %>% 
  glimpse()
## Rows: 6,599
## Columns: 11
## $ job_title          <chr> "Data Engineer", "Data Engineer", "Data Scientist",…
## $ experience_level   <chr> "Mid-level", "Mid-level", "Senior-level", "Senior-l…
## $ employment_type    <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
## $ work_models        <chr> "Remote", "Remote", "Remote", "Remote", "On-site", …
## $ work_year          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <chr> "United States", "United States", "United States", …
## $ salary             <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency    <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd      <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location   <chr> "United States", "United States", "United States", …
## $ company_size       <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "…

d. Data Cleansing

Before doing the next step, there are some column types that must be converted to factor types first:

  1. job_title,
  2. experience_level,
  3. employment_type,
  4. work_models,
  5. employee_residence,
  6. salary_currency,
  7. company_location,
  8. company_size.
data_salary_clean <- 
data_salary %>% 
  mutate_at(.vars = c("job_title", "experience_level", "employment_type", 
                      "work_models", "employee_residence", "salary_currency",
                      "company_location", "company_size"), 
                    .funs = as.factor)

To double-check if the data type is correct, we can look at the top 10 data.

data_salary_clean %>% 
  glimpse()
## Rows: 6,599
## Columns: 11
## $ job_title          <fct> Data Engineer, Data Engineer, Data Scientist, Data …
## $ experience_level   <fct> Mid-level, Mid-level, Senior-level, Senior-level, M…
## $ employment_type    <fct> Full-time, Full-time, Full-time, Full-time, Full-ti…
## $ work_models        <fct> Remote, Remote, Remote, Remote, On-site, On-site, O…
## $ work_year          <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <fct> United States, United States, United States, United…
## $ salary             <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency    <fct> USD, USD, USD, USD, USD, USD, USD, USD, USD, USD, U…
## $ salary_in_usd      <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location   <fct> United States, United States, United States, United…
## $ company_size       <fct> Medium, Medium, Medium, Medium, Medium, Medium, Med…

The data type of each column is correct, so the next step is to process this data.

3. Data Processing

The next step prepares the data that will be used for visualization. And each data is prepared to answer every existing business question.

b. Top 10 Job Title with Highest Salaries .

Jobs in the data field have various job titles, with different salary levels. To find out the 10 highest paid job titles, we can select some required columns. The required columns are job_title and salary_in_usd. Then calculate the average salary based on job_title.

# Prepare highest salary based on job_title & experience_level

salary_job <- 
  data_salary_clean %>% 
    select(job_title, experience_level, salary_in_usd) %>% 
    group_by(job_title, experience_level) %>% 
    summarise(avg_job = mean(salary_in_usd), .groups = 'drop') %>%
    mutate(label = glue("{comma(avg_job)} USD" ))
salary_job
## # A tibble: 279 × 4
##    job_title          experience_level avg_job label        
##    <fct>              <fct>              <dbl> <glue>       
##  1 AI Architect       Executive-level  215936  215,936.0 USD
##  2 AI Architect       Senior-level     233850  233,850.0 USD
##  3 AI Developer       Entry-level      110120. 110,119.5 USD
##  4 AI Developer       Mid-level        138294. 138,294.3 USD
##  5 AI Developer       Senior-level     162771. 162,770.7 USD
##  6 AI Engineer        Entry-level       28296. 28,296.5 USD 
##  7 AI Engineer        Mid-level        152988. 152,988.2 USD
##  8 AI Engineer        Senior-level     176706. 176,705.9 USD
##  9 AI Product Manager Senior-level     120000  120,000.0 USD
## 10 AI Programmer      Entry-level       56859. 56,858.8 USD 
## # ℹ 269 more rows

To find out the 10 positions that have the highest salary, we can use the order() function to sort the salaries from highest to lowest and select the top 10.

# to sort the salaries with the order() function
salary_job <- salary_job[order(salary_job$avg_job, decreasing = T),]
  
top10_job <- head(salary_job,10)
top10_job
## # A tibble: 10 × 4
##    job_title                      experience_level avg_job label        
##    <fct>                          <fct>              <dbl> <glue>       
##  1 Principal Data Scientist       Executive-level  416000  416,000.0 USD
##  2 Analytics Engineering Manager  Senior-level     399880  399,880.0 USD
##  3 Data Science Tech Lead         Senior-level     375000  375,000.0 USD
##  4 Managing Director Data Science Executive-level  280000  280,000.0 USD
##  5 AWS Data Architect             Mid-level        258000  258,000.0 USD
##  6 Deep Learning Engineer         Senior-level     254706  254,706.0 USD
##  7 Cloud Data Architect           Senior-level     250000  250,000.0 USD
##  8 AI Architect                   Senior-level     233850  233,850.0 USD
##  9 Machine Learning Scientist     Mid-level        226131. 226,131.2 USD
## 10 Director of Data Science       Executive-level  225244. 225,244.3 USD

The top10_job dataset is ready to be visualized using bar plot.

c. The Highest Salaries from Various Company Locations.

Every country has different salary standards for jobs in the field of data science. Some countries in the world have high average salary standards. To find out the countries that are willing to hire with high salary standards, we can start preparing the data. The first step is to select the columns to be used which are location_company and salary_in_usd. Calculate the average salary of each country.

grouped_loc <- 
  data_salary_clean %>%
    select(company_location, salary_in_usd) %>% 
    group_by(company_location) %>%
    summarise(avg_salary = mean(salary_in_usd)) %>% 
    mutate(label = glue("{comma(avg_salary)} USD" ))
grouped_loc
## # A tibble: 75 × 3
##    company_location       avg_salary label        
##    <fct>                       <dbl> <glue>       
##  1 Algeria                   100000  100,000.0 USD
##  2 Andorra                    50745  50,745.0 USD 
##  3 Argentina                  62000  62,000.0 USD 
##  4 Armenia                    50000  50,000.0 USD 
##  5 Australia                 114673. 114,673.4 USD
##  6 Austria                    71355. 71,354.8 USD 
##  7 Bahamas                    45555  45,555.0 USD 
##  8 Belgium                    76865. 76,864.8 USD 
##  9 Bosnia and Herzegovina    120000  120,000.0 USD
## 10 Brazil                     58569. 58,569.1 USD 
## # ℹ 65 more rows

Then we ranked the 10 salaries with the highest value.

# ranking using order() function
grouped_loc <- grouped_loc[order(grouped_loc$avg_salary, decreasing = T),]
  
top10_loc <- head(grouped_loc,10)
top10_loc
## # A tibble: 10 × 3
##    company_location       avg_salary label        
##    <fct>                       <dbl> <glue>       
##  1 Qatar                     300000  300,000.0 USD
##  2 Israel                    217332  217,332.0 USD
##  3 Puerto Rico               167500  167,500.0 USD
##  4 United States             157073. 157,073.1 USD
##  5 New Zealand               151634. 151,634.3 USD
##  6 Canada                    139833. 139,832.8 USD
##  7 Saudi Arabia              134999  134,999.0 USD
##  8 Ukraine                   121333. 121,333.3 USD
##  9 Bosnia and Herzegovina    120000  120,000.0 USD
## 10 Australia                 114673. 114,673.4 USD

The top10_locdataset is ready to be visualized using bar plot

d. Salary Distribution by Experience Level & Company Size.

We want to analyze the average standard salary given to data scientists based on the size of a company and a person’s work experience level. Is there a difference in salary from the same level of work experience but different company levels? To do the analysis we need 3 columns experience_level, company_size and salary_in_usd. And calculate the average salary based on experience_level & company_size.

exp_level <- 
  data_salary_clean %>% 
    select(experience_level, company_size, salary_in_usd) %>% 
    group_by(experience_level, company_size) %>%
    summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = 'drop') %>% 
    mutate(label = glue("{comma(avg_salary)} USD" ))
exp_level
## # A tibble: 12 × 4
##    experience_level company_size avg_salary label      
##    <fct>            <fct>             <dbl> <glue>     
##  1 Entry-level      Large            74603. 74,603 USD 
##  2 Entry-level      Medium           89223. 89,223 USD 
##  3 Entry-level      Small            68486. 68,486 USD 
##  4 Executive-level  Large           187994. 187,994 USD
##  5 Executive-level  Medium          190563. 190,563 USD
##  6 Executive-level  Small           169172. 169,172 USD
##  7 Mid-level        Large           100625. 100,625 USD
##  8 Mid-level        Medium          123304. 123,304 USD
##  9 Mid-level        Small            73770. 73,770 USD 
## 10 Senior-level     Large           150557. 150,557 USD
## 11 Senior-level     Medium          163532. 163,532 USD
## 12 Senior-level     Small           110274. 110,274 USD

The exp_level dataset is ready to be visualized using multivariate plot

4. Data Visualization

b. Top 10 Job Title with Highest Salaries 2020-2024

We want to analyze the highest average salary based on the job title and experience level of the data experts. Is the highest average salary only for employees who have a high level of experience or not? To visualize this analysis, I used a bar-plot. The goal is to rank the highest average salary by job title and experience level.We use the avg_job on the x-axis and job_title on the y-axis.

plot2 <- 
  ggplot(data = top10_job,
         mapping = aes(x = avg_job,
                       y = reorder(job_title,avg_job),
                       fill = experience_level))+
  
  geom_col(aes(fill=experience_level))+

  labs(title = "Highest Salaries 2020-2024", 
       subtitle = "Top 10 Job Title with Highest Salaries", 
       x = "Salary in USD",
       y = "Job Title",
       fill = "Experience Level") +
  
  theme_minimal() +
  
  theme(legend.position = "right")+
  
  geom_text(aes(label = label), size = 2.5, nudge_x = 10000)+
  
  scale_x_continuous(labels = comma)                     

plot2

Insight :

  • The highest average salary is 416,000 USD for Principal Data Scientists whose experience level is also high (executive level).

  • From the 10 highest salaries, it turns out that there are 2 positions, such as AWS Data Architect and Machine Learning Scientist, which have experience levels as mid-level.

c. Top 10 Highest Salaries by Company Location 2020-2024

To find out the countries that are willing to hire with high salary standards, which is sorted into 10 countries that have the highest average salary standard. To visualize this, I used a bar-plot to easily rank the 10 countries. We use the avg_salary on the x-axis and company_location on the y-axis.

plot3 <- 
ggplot(top10_loc, aes(x = avg_salary,
                        y = reorder(company_location,avg_salary),
                      fill = avg_salary)) +
  
  geom_bar(stat = "identity") +
  
  scale_fill_gradient(low = "red", high = "green")+  
  
  labs(title = "Highest Salaries 2020-2024", 
       subtitle = "Top 10 Company Location", 
       x = "Salary in USD",
       y = "Company Location") +
  
  theme_minimal() +
  
  theme(legend.position = "none")+
  
  geom_text(aes(label = label), size = 2.5, nudge_x = 25000)+
  
  scale_x_continuous(labels = comma)

plot3

Insight :

  • Qatar is a country that provides the highest average salary standard of 300,000.0 USD.

  • Israel is the second country that provides the highest average salary standard of 217,332.0 USD.

  • There is a significant difference in the highest average salary from Australia as the 10th country with Qatar as the 1st country, amounting to 185,326.6 USD.

  • There is a notable difference between the average salary in Puerto Rico (167,500.0 USD) and Israel (217,332.0 USD) with a difference of 49,832 USD, and Israel (217,332.0 USD) and Qatar (300,000.0 USD) with a difference of 82,668 USD.

d. Salary Distribution by Experience Level and Company Size in 2020-2024

We want to analyze the average standard salary given to data scientists based on the size of a company and a person’s work experience level. Is there a difference in salary from the same level of work experience but different company levels? To visualize the average salary distribution for data scientists, I used multivariate by grouping avg_salary by experience_level and company_size.

# Create barplot for salary distribution by experience level & company size
plot4 <- 
  ggplot(exp_level, aes(x = company_size, y = avg_salary, fill = experience_level)) +
  
  geom_bar(stat = "identity", position = "dodge") +

  labs(x = "Company Size",
       y = "Salary (USD)",
       title = "Salaries Distribution by Company Size and Experience Level in 2020-2024",
       fill = "Experience Level") +

  theme_minimal()+

  scale_y_continuous(labels = comma)+
  
  theme(legend.position = "right")

plot4

Insight :

  • It is clear that the average standard salary for data scientists of each experience level is different, depending on the company size.

  • Companies with a medium scale, in fact, provide higher salaries than companies with a large scale at every level of experience level.

  • The average salary value of the highest experience level, which is executive-level in a medium company is 190,563 USD, which is greater than a large company of 187,994 USD. This also happens at other experience levels, which are entry-level, mid-level and senior-level.

  • In this case, it means that the high average salary is not based on a large-scale company.