Nowadays technology is developing very rapidly, so that many new professions have emerged. One profession that is familiar to hear is the data scientist profession. Where this profession is tasked with collecting and cleaning various irregular data, analyzing several types of data in large quantities to find insight (insight or understanding of an issue). For companies, these insights will later become strategic recommendations that will be used by company shareholders to develop their business.
Not only is the data scientist profession in the data sector, there are data engineers, data analysts, machine learning engineers and several other related professions. The number of new professions that have emerged related to the big data sector has made this new profession a very promising job opportunity, this is not only the case in Indonesia, but throughout the world.
The salary of a data expert is also a highlight for job seekers. However, it can be ensured that there are several factors that greatly affect the amount of salary. One of the main factors is the work experience of a data scientist (entry-level to senior level). Another influencing factor is the location of the job, because each country has its own salary standards. Educational background and certification in the data sector also affect the amount of salary offered.
Until now, the profession as a data expert is a profession that is highly demanded by companies. Possibly because there are not many people who explore expertise in the field of data. So are you interested in learning to be a data expert?
Project LBB (Programming for Data Visualization) this time I used data on the salary of the data expert profession from 2020 to 2024 in several countries in the world, the dataset was obtained from https://www.kaggle.com.
The following is a description of each column of the dataset about data scientist salary 2020-2024 in various countries. .
Column Name | Description |
---|---|
job_title | The job title or role associated with the reported salary. |
experience_level | The level of experience of the individual. |
employment_type | Indicates whether the employment is full-time, part-time, etc. |
work_models | Describes different working models (remote, on-site, hybrid). |
work_year | The specific year in which the salary information was recorded. |
employee_residence | The residence location of the employee. |
salary | The reported salary in the original currency. |
salary_currency | The currency in which the salary is denominated. |
salary_in_usd | The converted salary in US dollars. |
company_location | The geographic location of the employing organization. |
company_size | The size of the company, categorized by the number of employees. |
The first step is to import the dataset using the
read.csv()
function.
data_salary <- read.csv("input_data/data_science_salaries.csv")
head(data_salary)
## job_title experience_level employment_type work_models work_year
## 1 Data Engineer Mid-level Full-time Remote 2024
## 2 Data Engineer Mid-level Full-time Remote 2024
## 3 Data Scientist Senior-level Full-time Remote 2024
## 4 Data Scientist Senior-level Full-time Remote 2024
## 5 BI Developer Mid-level Full-time On-site 2024
## 6 BI Developer Mid-level Full-time On-site 2024
## employee_residence salary salary_currency salary_in_usd company_location
## 1 United States 148100 USD 148100 United States
## 2 United States 98700 USD 98700 United States
## 3 United States 140032 USD 140032 United States
## 4 United States 100022 USD 100022 United States
## 5 United States 120000 USD 120000 United States
## 6 United States 62100 USD 62100 United States
## company_size
## 1 Medium
## 2 Medium
## 3 Medium
## 4 Medium
## 5 Medium
## 6 Medium
The next step is to investigate the imported dataset, because we want
to observe the initial and final data of the data_salaries
dataset. We use the head()
and tail()
functions.
head(data_salary)
## job_title experience_level employment_type work_models work_year
## 1 Data Engineer Mid-level Full-time Remote 2024
## 2 Data Engineer Mid-level Full-time Remote 2024
## 3 Data Scientist Senior-level Full-time Remote 2024
## 4 Data Scientist Senior-level Full-time Remote 2024
## 5 BI Developer Mid-level Full-time On-site 2024
## 6 BI Developer Mid-level Full-time On-site 2024
## employee_residence salary salary_currency salary_in_usd company_location
## 1 United States 148100 USD 148100 United States
## 2 United States 98700 USD 98700 United States
## 3 United States 140032 USD 140032 United States
## 4 United States 100022 USD 100022 United States
## 5 United States 120000 USD 120000 United States
## 6 United States 62100 USD 62100 United States
## company_size
## 1 Medium
## 2 Medium
## 3 Medium
## 4 Medium
## 5 Medium
## 6 Medium
tail(data_salary)
## job_title experience_level employment_type work_models
## 6594 Principal Data Scientist Senior-level Full-time Remote
## 6595 Staff Data Analyst Entry-level Contract Hybrid
## 6596 Staff Data Analyst Executive-level Full-time On-site
## 6597 Machine Learning Manager Senior-level Full-time Hybrid
## 6598 Data Engineer Mid-level Full-time Hybrid
## 6599 Data Scientist Senior-level Full-time On-site
## work_year employee_residence salary salary_currency salary_in_usd
## 6594 2020 Germany 130000 EUR 148261
## 6595 2020 Canada 60000 CAD 44753
## 6596 2020 Nigeria 15000 USD 15000
## 6597 2020 Canada 157000 CAD 117104
## 6598 2020 Austria 65000 EUR 74130
## 6599 2020 Austria 80000 EUR 91237
## company_location company_size
## 6594 Germany Medium
## 6595 Canada Large
## 6596 Canada Medium
## 6597 Canada Large
## 6598 Austria Large
## 6599 Austria Small
To find out the suitable data type, it is checked first with the
glimpse() function.
library(dplyr)
data_salary %>%
glimpse()
## Rows: 6,599
## Columns: 11
## $ job_title <chr> "Data Engineer", "Data Engineer", "Data Scientist",…
## $ experience_level <chr> "Mid-level", "Mid-level", "Senior-level", "Senior-l…
## $ employment_type <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
## $ work_models <chr> "Remote", "Remote", "Remote", "Remote", "On-site", …
## $ work_year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <chr> "United States", "United States", "United States", …
## $ salary <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "U…
## $ salary_in_usd <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location <chr> "United States", "United States", "United States", …
## $ company_size <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "…
Before doing the next step, there are some column types that must be converted to factor types first:
data_salary_clean <-
data_salary %>%
mutate_at(.vars = c("job_title", "experience_level", "employment_type",
"work_models", "employee_residence", "salary_currency",
"company_location", "company_size"),
.funs = as.factor)
To double-check if the data type is correct, we can look at the top 10 data.
data_salary_clean %>%
glimpse()
## Rows: 6,599
## Columns: 11
## $ job_title <fct> Data Engineer, Data Engineer, Data Scientist, Data …
## $ experience_level <fct> Mid-level, Mid-level, Senior-level, Senior-level, M…
## $ employment_type <fct> Full-time, Full-time, Full-time, Full-time, Full-ti…
## $ work_models <fct> Remote, Remote, Remote, Remote, On-site, On-site, O…
## $ work_year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
## $ employee_residence <fct> United States, United States, United States, United…
## $ salary <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ salary_currency <fct> USD, USD, USD, USD, USD, USD, USD, USD, USD, USD, U…
## $ salary_in_usd <int> 148100, 98700, 140032, 100022, 120000, 62100, 25000…
## $ company_location <fct> United States, United States, United States, United…
## $ company_size <fct> Medium, Medium, Medium, Medium, Medium, Medium, Med…
The data type of each column is correct, so the next step is to process this data.
The next step prepares the data that will be used for visualization. And each data is prepared to answer every existing business question.
A job seeker who works as a data scientist wants to know how much the average salary increases each year from various countries. Based on the data, we can see the trend of average salaries for data scientist, whether there is a significant increase or a decrease or no increase.
The first step is to select only the required columns from the
data_salary_clean
dataset. Then calculate the average
salary per year (2020-2024) and add a column to display the label on the
line plot that will be created.
library(dplyr)
library(glue)
library(scales)
# Create average salary by work year
average_salary_year <-
data_salary_clean %>%
select(work_year, salary_in_usd) %>%
group_by(work_year) %>%
summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE)) %>%
mutate(label = glue("{comma(avg_salary)} USD" ))
average_salary_year
## # A tibble: 5 × 3
## work_year avg_salary label
## <int> <dbl> <glue>
## 1 2020 102251. 102,251 USD
## 2 2021 99501. 99,501 USD
## 3 2022 131789. 131,789 USD
## 4 2023 150791. 150,791 USD
## 5 2024 153124. 153,124 USD
The average_salary_year
data is ready for visualization
using line plot.
Jobs in the data field have various job titles, with different salary
levels. To find out the 10 highest paid job titles, we can select some
required columns. The required columns are job_title
and
salary_in_usd
. Then calculate the average salary based on
job_title
.
# Prepare highest salary based on job_title & experience_level
salary_job <-
data_salary_clean %>%
select(job_title, experience_level, salary_in_usd) %>%
group_by(job_title, experience_level) %>%
summarise(avg_job = mean(salary_in_usd), .groups = 'drop') %>%
mutate(label = glue("{comma(avg_job)} USD" ))
salary_job
## # A tibble: 279 × 4
## job_title experience_level avg_job label
## <fct> <fct> <dbl> <glue>
## 1 AI Architect Executive-level 215936 215,936.0 USD
## 2 AI Architect Senior-level 233850 233,850.0 USD
## 3 AI Developer Entry-level 110120. 110,119.5 USD
## 4 AI Developer Mid-level 138294. 138,294.3 USD
## 5 AI Developer Senior-level 162771. 162,770.7 USD
## 6 AI Engineer Entry-level 28296. 28,296.5 USD
## 7 AI Engineer Mid-level 152988. 152,988.2 USD
## 8 AI Engineer Senior-level 176706. 176,705.9 USD
## 9 AI Product Manager Senior-level 120000 120,000.0 USD
## 10 AI Programmer Entry-level 56859. 56,858.8 USD
## # ℹ 269 more rows
To find out the 10 positions that have the highest salary, we can use
the order()
function to sort the salaries from highest to
lowest and select the top 10.
# to sort the salaries with the order() function
salary_job <- salary_job[order(salary_job$avg_job, decreasing = T),]
top10_job <- head(salary_job,10)
top10_job
## # A tibble: 10 × 4
## job_title experience_level avg_job label
## <fct> <fct> <dbl> <glue>
## 1 Principal Data Scientist Executive-level 416000 416,000.0 USD
## 2 Analytics Engineering Manager Senior-level 399880 399,880.0 USD
## 3 Data Science Tech Lead Senior-level 375000 375,000.0 USD
## 4 Managing Director Data Science Executive-level 280000 280,000.0 USD
## 5 AWS Data Architect Mid-level 258000 258,000.0 USD
## 6 Deep Learning Engineer Senior-level 254706 254,706.0 USD
## 7 Cloud Data Architect Senior-level 250000 250,000.0 USD
## 8 AI Architect Senior-level 233850 233,850.0 USD
## 9 Machine Learning Scientist Mid-level 226131. 226,131.2 USD
## 10 Director of Data Science Executive-level 225244. 225,244.3 USD
The top10_job
dataset is ready to be visualized using
bar plot.
Every country has different salary standards for jobs in the field of
data science. Some countries in the world have high average salary
standards. To find out the countries that are willing to hire with high
salary standards, we can start preparing the data. The first step is to
select the columns to be used which are location_company
and salary_in_usd
. Calculate the average salary of each
country.
grouped_loc <-
data_salary_clean %>%
select(company_location, salary_in_usd) %>%
group_by(company_location) %>%
summarise(avg_salary = mean(salary_in_usd)) %>%
mutate(label = glue("{comma(avg_salary)} USD" ))
grouped_loc
## # A tibble: 75 × 3
## company_location avg_salary label
## <fct> <dbl> <glue>
## 1 Algeria 100000 100,000.0 USD
## 2 Andorra 50745 50,745.0 USD
## 3 Argentina 62000 62,000.0 USD
## 4 Armenia 50000 50,000.0 USD
## 5 Australia 114673. 114,673.4 USD
## 6 Austria 71355. 71,354.8 USD
## 7 Bahamas 45555 45,555.0 USD
## 8 Belgium 76865. 76,864.8 USD
## 9 Bosnia and Herzegovina 120000 120,000.0 USD
## 10 Brazil 58569. 58,569.1 USD
## # ℹ 65 more rows
Then we ranked the 10 salaries with the highest value.
# ranking using order() function
grouped_loc <- grouped_loc[order(grouped_loc$avg_salary, decreasing = T),]
top10_loc <- head(grouped_loc,10)
top10_loc
## # A tibble: 10 × 3
## company_location avg_salary label
## <fct> <dbl> <glue>
## 1 Qatar 300000 300,000.0 USD
## 2 Israel 217332 217,332.0 USD
## 3 Puerto Rico 167500 167,500.0 USD
## 4 United States 157073. 157,073.1 USD
## 5 New Zealand 151634. 151,634.3 USD
## 6 Canada 139833. 139,832.8 USD
## 7 Saudi Arabia 134999 134,999.0 USD
## 8 Ukraine 121333. 121,333.3 USD
## 9 Bosnia and Herzegovina 120000 120,000.0 USD
## 10 Australia 114673. 114,673.4 USD
The top10_loc
dataset is ready to be visualized using bar
plot
We want to analyze the average standard salary given to data
scientists based on the size of a company and a person’s work experience
level. Is there a difference in salary from the same level of work
experience but different company levels? To do the analysis we need 3
columns experience_level
, company_size
and
salary_in_usd
. And calculate the average salary based on
experience_level
& company_size
.
exp_level <-
data_salary_clean %>%
select(experience_level, company_size, salary_in_usd) %>%
group_by(experience_level, company_size) %>%
summarise(avg_salary = mean(salary_in_usd, na.rm = TRUE), .groups = 'drop') %>%
mutate(label = glue("{comma(avg_salary)} USD" ))
exp_level
## # A tibble: 12 × 4
## experience_level company_size avg_salary label
## <fct> <fct> <dbl> <glue>
## 1 Entry-level Large 74603. 74,603 USD
## 2 Entry-level Medium 89223. 89,223 USD
## 3 Entry-level Small 68486. 68,486 USD
## 4 Executive-level Large 187994. 187,994 USD
## 5 Executive-level Medium 190563. 190,563 USD
## 6 Executive-level Small 169172. 169,172 USD
## 7 Mid-level Large 100625. 100,625 USD
## 8 Mid-level Medium 123304. 123,304 USD
## 9 Mid-level Small 73770. 73,770 USD
## 10 Senior-level Large 150557. 150,557 USD
## 11 Senior-level Medium 163532. 163,532 USD
## 12 Senior-level Small 110274. 110,274 USD
The exp_level
dataset is ready to be visualized using
multivariate plot
We want to see the trend of the average salary of a data scientist per year from 2020-2024. For that, we use a line plot as a visualization. Create a line chart showing how average salaries have changed over the years. This can help identify trends and fluctuations. Use the work_year on the x-axis and the average salary on the y-axis.
library(ggplot2)
# Create a line plot Salary Trends Over Time (by Work Year)
plot1 <-
ggplot(average_salary_year,
aes(x = work_year,
y = avg_salary)) +
geom_line(color = "red", size = 1) +
geom_point(color = "black", size = 3) +
labs(title = "Average Salary Trends in 2020-2024",
x = "Year",
y = "Average Salary in USD")+
theme_minimal() +
theme(legend.position = "none")+
geom_text(aes(label = label), size = 3, nudge_x = 0.3 )+
scale_y_continuous(labels = comma)
plot1
Insight :
The average salary in 2020 was recorded at 102,251 USD, but in 2021 the average salary decreased to 99,501 USD.
There was a significant increase in average salary from 2021 to 2022 which became 131,789 USD.
The fluctuations of the average salary as a data expert generally increases from year to year.
We want to analyze the highest average salary based on the job title
and experience level of the data experts. Is the highest average salary
only for employees who have a high level of experience or not? To
visualize this analysis, I used a bar-plot. The goal is to rank the
highest average salary by job title and experience level.We use the
avg_job
on the x-axis and job_title
on the
y-axis.
plot2 <-
ggplot(data = top10_job,
mapping = aes(x = avg_job,
y = reorder(job_title,avg_job),
fill = experience_level))+
geom_col(aes(fill=experience_level))+
labs(title = "Highest Salaries 2020-2024",
subtitle = "Top 10 Job Title with Highest Salaries",
x = "Salary in USD",
y = "Job Title",
fill = "Experience Level") +
theme_minimal() +
theme(legend.position = "right")+
geom_text(aes(label = label), size = 2.5, nudge_x = 10000)+
scale_x_continuous(labels = comma)
plot2
Insight :
The highest average salary is 416,000 USD for Principal Data Scientists whose experience level is also high (executive level).
From the 10 highest salaries, it turns out that there are 2 positions, such as AWS Data Architect and Machine Learning Scientist, which have experience levels as mid-level.
To find out the countries that are willing to hire with high salary
standards, which is sorted into 10 countries that have the highest
average salary standard. To visualize this, I used a bar-plot to easily
rank the 10 countries. We use the avg_salary
on the x-axis
and company_location
on the y-axis.
plot3 <-
ggplot(top10_loc, aes(x = avg_salary,
y = reorder(company_location,avg_salary),
fill = avg_salary)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "red", high = "green")+
labs(title = "Highest Salaries 2020-2024",
subtitle = "Top 10 Company Location",
x = "Salary in USD",
y = "Company Location") +
theme_minimal() +
theme(legend.position = "none")+
geom_text(aes(label = label), size = 2.5, nudge_x = 25000)+
scale_x_continuous(labels = comma)
plot3
Insight :
Qatar is a country that provides the highest average salary standard of 300,000.0 USD.
Israel is the second country that provides the highest average salary standard of 217,332.0 USD.
There is a significant difference in the highest average salary from Australia as the 10th country with Qatar as the 1st country, amounting to 185,326.6 USD.
There is a notable difference between the average salary in Puerto Rico (167,500.0 USD) and Israel (217,332.0 USD) with a difference of 49,832 USD, and Israel (217,332.0 USD) and Qatar (300,000.0 USD) with a difference of 82,668 USD.
We want to analyze the average standard salary given to data
scientists based on the size of a company and a person’s work experience
level. Is there a difference in salary from the same level of work
experience but different company levels? To visualize the average salary
distribution for data scientists, I used multivariate by grouping
avg_salary
by experience_level
and
company_size
.
# Create barplot for salary distribution by experience level & company size
plot4 <-
ggplot(exp_level, aes(x = company_size, y = avg_salary, fill = experience_level)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Company Size",
y = "Salary (USD)",
title = "Salaries Distribution by Company Size and Experience Level in 2020-2024",
fill = "Experience Level") +
theme_minimal()+
scale_y_continuous(labels = comma)+
theme(legend.position = "right")
plot4
Insight :
It is clear that the average standard salary for data scientists of each experience level is different, depending on the company size.
Companies with a medium scale, in fact, provide higher salaries than companies with a large scale at every level of experience level.
The average salary value of the highest experience level, which is executive-level in a medium company is 190,563 USD, which is greater than a large company of 187,994 USD. This also happens at other experience levels, which are entry-level, mid-level and senior-level.
In this case, it means that the high average salary is not based on a large-scale company.