The data science field is constantly changing, and it’s important for professionals and organizations to understand salary trends. This dataset is focused on Data Science Salaries from 2020 to 2024, aiming to provide insights into salary trends, regional differences, and potential factors impacting compensation within the data science community.
The dataset covers a wide range of data science salary information over the course of five years, from 2020 to 2024, offering a comprehensive view of compensation in the field.
Project LBB (Programming for data science) this time I used data on the salary of the data expert profession from 2020 to 2024 in several countries in the world, the dataset was obtained from https://www.kaggle.com
The first step is to import the dataset using the
read.csv()
function.
data_salaries <- read.csv("input_data/data_science_salaries.csv")
This data_salaries
dataset has 11 variables and 6599
observations. And from the dataset above, these are the description of
each column.
Column Name | Description |
---|---|
job_title | The job title or role associated with the reported salary. |
experience_level | The level of experience of the individual. |
employment_type | Indicates whether the employment is full-time, part-time, etc. |
work_models | Describes different working models (remote, on-site, hybrid). |
work_year | The specific year in which the salary information was recorded. |
employee_residence | The residence location of the employee. |
salary | The reported salary in the original currency. |
salary_currency | The currency in which the salary is denominated. |
salary_in_usd | The converted salary in US dollars. |
company_location | The geographic location of the employing organization. |
company_size | The size of the company, categorized by the number of employees. |
The next step is to investigate the imported dataset, because we want
to observe the initial and final data of the data_salaries
dataset. We use the head()
and tail()
functions.
head(data_salaries)
## job_title experience_level employment_type work_models work_year
## 1 Data Engineer Mid-level Full-time Remote 2024
## 2 Data Engineer Mid-level Full-time Remote 2024
## 3 Data Scientist Senior-level Full-time Remote 2024
## 4 Data Scientist Senior-level Full-time Remote 2024
## 5 BI Developer Mid-level Full-time On-site 2024
## 6 BI Developer Mid-level Full-time On-site 2024
## employee_residence salary salary_currency salary_in_usd company_location
## 1 United States 148100 USD 148100 United States
## 2 United States 98700 USD 98700 United States
## 3 United States 140032 USD 140032 United States
## 4 United States 100022 USD 100022 United States
## 5 United States 120000 USD 120000 United States
## 6 United States 62100 USD 62100 United States
## company_size
## 1 Medium
## 2 Medium
## 3 Medium
## 4 Medium
## 5 Medium
## 6 Medium
tail(data_salaries)
## job_title experience_level employment_type work_models
## 6594 Principal Data Scientist Senior-level Full-time Remote
## 6595 Staff Data Analyst Entry-level Contract Hybrid
## 6596 Staff Data Analyst Executive-level Full-time On-site
## 6597 Machine Learning Manager Senior-level Full-time Hybrid
## 6598 Data Engineer Mid-level Full-time Hybrid
## 6599 Data Scientist Senior-level Full-time On-site
## work_year employee_residence salary salary_currency salary_in_usd
## 6594 2020 Germany 130000 EUR 148261
## 6595 2020 Canada 60000 CAD 44753
## 6596 2020 Nigeria 15000 USD 15000
## 6597 2020 Canada 157000 CAD 117104
## 6598 2020 Austria 65000 EUR 74130
## 6599 2020 Austria 80000 EUR 91237
## company_location company_size
## 6594 Germany Medium
## 6595 Canada Large
## 6596 Canada Medium
## 6597 Canada Large
## 6598 Austria Large
## 6599 Austria Small
Checking and adjusting the data structure of each column is done
using the str()
function.
str(data_salaries)
## 'data.frame': 6599 obs. of 11 variables:
## $ job_title : chr "Data Engineer" "Data Engineer" "Data Scientist" "Data Scientist" ...
## $ experience_level : chr "Mid-level" "Mid-level" "Senior-level" "Senior-level" ...
## $ employment_type : chr "Full-time" "Full-time" "Full-time" "Full-time" ...
## $ work_models : chr "Remote" "Remote" "Remote" "Remote" ...
## $ work_year : int 2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
## $ employee_residence: chr "United States" "United States" "United States" "United States" ...
## $ salary : int 148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
## $ salary_currency : chr "USD" "USD" "USD" "USD" ...
## $ salary_in_usd : int 148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
## $ company_location : chr "United States" "United States" "United States" "United States" ...
## $ company_size : chr "Medium" "Medium" "Medium" "Medium" ...
summary(data_salaries)
## job_title experience_level employment_type work_models
## Length:6599 Length:6599 Length:6599 Length:6599
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## work_year employee_residence salary salary_currency
## Min. :2020 Length:6599 Min. : 14000 Length:6599
## 1st Qu.:2023 Class :character 1st Qu.: 96000 Class :character
## Median :2023 Mode :character Median : 140000 Mode :character
## Mean :2023 Mean : 179283
## 3rd Qu.:2023 3rd Qu.: 187500
## Max. :2024 Max. :30400000
## salary_in_usd company_location company_size
## Min. : 15000 Length:6599 Length:6599
## 1st Qu.: 95000 Class :character Class :character
## Median :138666 Mode :character Mode :character
## Mean :145561
## 3rd Qu.:185000
## Max. :750000
From the data structure above, we do data cleansing from several columns that are not yet suitable to the data type. Here are some columns that will be adjusted to the data type as a factor.
job_title
experience_level
employment_type
work_models
employee_residence
salary_currency
company_location
company_size
data_salaries$job_title <- as.factor(data_salaries$job_title)
data_salaries$experience_level <- as.factor(data_salaries$experience_level)
data_salaries$employment_type <- as.factor(data_salaries$employment_type)
data_salaries$work_models <- as.factor(data_salaries$work_models)
data_salaries$employee_residence <- as.factor(data_salaries$employee_residence)
data_salaries$salary_currency <- as.factor(data_salaries$salary_currency)
data_salaries$company_location <- as.factor(data_salaries$company_location)
data_salaries$company_size <- as.factor(data_salaries$company_size)
The results of the column type adjustment can be checked again, who knows if there are columns that are still not suitable.
str(data_salaries)
## 'data.frame': 6599 obs. of 11 variables:
## $ job_title : Factor w/ 132 levels "AI Architect",..: 47 47 74 74 20 20 125 125 47 47 ...
## $ experience_level : Factor w/ 4 levels "Entry-level",..: 3 3 4 4 3 3 1 1 2 2 ...
## $ employment_type : Factor w/ 4 levels "Contract","Freelance",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ work_models : Factor w/ 3 levels "Hybrid","On-site",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ work_year : int 2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
## $ employee_residence: Factor w/ 87 levels "Algeria","American Samoa",..: 85 85 85 85 85 85 85 85 85 85 ...
## $ salary : int 148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
## $ salary_currency : Factor w/ 22 levels "AUD","BRL","CAD",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ salary_in_usd : int 148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
## $ company_location : Factor w/ 75 levels "Algeria","Andorra",..: 74 74 74 74 74 74 74 74 74 74 ...
## $ company_size : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
The purpose of checking for missing values is to know whether the dataset is clean from missing values or not. If there are missing values, handle the missing values appropriately to ensure the accuracy and reliability of your analysis.
To find out if our dataset has missing values with the
anyNA()
function
anyNA(data_salaries)
## [1] FALSE
To check for missing values and count them in columns, you can use
the is.na()
and colSums()
functions.
colSums(is.na(data_salaries))
## job_title experience_level employment_type work_models
## 0 0 0 0
## work_year employee_residence salary salary_currency
## 0 0 0 0
## salary_in_usd company_location company_size
## 0 0 0
Insight : from the results of checking above, this dataset does not have missing values. This means we can proceed to the next stage of analysis.
summary(data_salaries)
## job_title experience_level employment_type
## Data Engineer :1307 Entry-level : 565 Contract : 19
## Data Scientist :1243 Executive-level: 254 Freelance: 12
## Data Analyst : 910 Mid-level :1675 Full-time:6552
## Machine Learning Engineer: 629 Senior-level :4105 Part-time: 16
## Analytics Engineer : 246
## Research Scientist : 206
## (Other) :2058
## work_models work_year employee_residence salary
## Hybrid : 225 Min. :2020 United States :5305 Min. : 14000
## On-site:3813 1st Qu.:2023 United Kingdom: 401 1st Qu.: 96000
## Remote :2561 Median :2023 Canada : 241 Median : 140000
## Mean :2023 Germany : 71 Mean : 179283
## 3rd Qu.:2023 India : 70 3rd Qu.: 187500
## Max. :2024 Spain : 67 Max. :30400000
## (Other) : 444
## salary_currency salary_in_usd company_location company_size
## USD :5827 Min. : 15000 United States :5354 Large : 569
## GBP : 334 1st Qu.: 95000 United Kingdom: 408 Medium:5860
## EUR : 292 Median :138666 Canada : 243 Small : 170
## INR : 51 Mean :145561 Germany : 78
## CAD : 39 3rd Qu.:185000 Spain : 63
## AUD : 11 Max. :750000 India : 58
## (Other): 45 (Other) : 395
Here is the information we can use for further analysis:
This data only provides information from 2020 to 2024
There are 4 levels of experience which are Senior Level (SE), Entry Level (EN), Executive Level (EX), and Mid-Level (MI). The largest number of workers is at the senior level with 4105 workers and the smallest at the executive level with 254 workers
There are 4 tipes of employment which are ful-time (FT), contract (CT), part-time (PT), and freelance (FL). Full-time workers have the largest number at 6552 workers and freelance workers have the smallest number at 12 workers.
There are several positions, including Data Engineer, Data Scientist, Data Analyst, Machine Learning Engineer, Research Scientist, Analytics Engineer and many others.
From all levels of workers, the minimum salary range was obtained of 14,000, the average salary received was 179,283 and the highest salary was 30,400,000. However, the information above is still in different currencies.
The most widely used currencies for salary payments are USD, GBP, EUR, INR, CAD and AUD.
Salary payments using USD currency have a minimum value of as much as 15,000 , average 145,561 and the maximum value is as much as 750,000.
The location where the most employees live is from these 6 countries, United States (US), United Kingdom (GB), Canada (CA), Germany (DE), India (IN), Spain (ES) and many more.
The average number of people that worked for the company during the year; S : less than 50 employees (small), M : 50 to 250 employees (medium), L : more than 250 employees (large). There are many medium-scale companies as many as 5860 and a few from small-scale companies as many as 170.
The country of the employer’s main office or contracting branch most are found in countries United States (US), United Kingdom (GB), Canada (CA), Spain (ES).
Describes different working models (remote, on-site, hybrid), most of the companies working models are on-site (3.813) and smallest is hybrid (225).
There are 3 things in descriptive statistics that can be explored:
1.Business Question
To understand how the average salary has evolved over the years and whether there are any noticeable patterns or fluctuations. Analyzing these trends can provide insight into compensation strategies, employee retention, and overall workforce satisfaction. Based on the business question, we want to know the trend of the average salary of the jobs in the data sector from 2020-2024.
Descriptive statistics summarize and describe the main features of the data. We’ll use these to understand the central tendency, variability, and distribution of salaries over different work years.
Steps using descriptive statistics:
1. Measure of Central Tendency
Calculate mean of salary_in_usd
(if there is no
outliers) and median value of salary_in_usd
(there are
outliers)
# Calculate mean value from salary in USD
mean(data_salaries$salary_in_usd)
## [1] 145560.6
# Caalculate median value from salary in USD
median(data_salaries$salary_in_usd)
## [1] 138666
Insight : There is a difference in value between the average and the median. This difference is due to the fact that salary data in USD has an outlier. Another way to check the presence of outliers in the data is to create a boxplot.
2. Measure of Spread
To describe the deviation of the value from the mean point, the
sd()
function can be used
sd(data_salaries$salary_in_usd)
## [1] 70946.84
Insight : The interpretation of the normal value is from the mean value +/- standard deviation
So we can calculate the lower limit and upper limit by using the mean value
# Calculate the lower limit and upper limit of the salary_in_USD value
lower = 145560.6 - 70946.84
lower
## [1] 74613.76
upper = 145560.6 + 70946.84
upper
## [1] 216507.4
Insight :
The range of mean value of salary ranges from 74,613.76 USD - 216,507.4 USD
We can create box plots to visualize salary distributions.
Box Plot
Another way to check the presence of outliers in the data is to create a boxplot.
boxplot(data_salaries$salary_in_usd, horizontal = T)
summary(data_salaries$salary_in_usd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15000 95000 138666 145561 185000 750000
IQR(data_salaries$salary_in_usd)
## [1] 90000
Insight from the box plot above :
IQR = 90,0000, this means central 50% of the data falls around this value
Since mean value is greater than median value, the distribution tends to be right-skewed. This could happens because of the presence of the outliers ( max value 750,000)
Inferential statistics allow us to make inferences or predictions about a population based on a sample. We’ll use these to draw conclusions about salary trends over time.
Analysis from business questions
Based on a dataset from Data Science Salaries from 2020 to 2024, it is known that the number of workers with senior-level is 4105 people with an average salary from 2020-2024 of 162,071 USD. Meanwhile, the average salary of all workers is 145,561 USD with a standard deviation of 70946.84.
A newly established IT company in Indonesia wants to know whether from 2020 to 2024, the salary of senior-level workers has increased significantly? By using a 95% confidence level
Answer
H0 : There is no increase in senior-level salaries from 2020-2024
H1 : There is a significant increase in senior-level salaries from 2020-2024
Find: Z value and SE
library(dplyr)
#calculate mean sample = average salary of senior level
data_salary_senior <-
data_salaries %>%
filter(experience_level == "Senior-level")
summary(data_salary_senior$salary_in_usd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15809 118000 153600 162071 199200 750000
head(data_salary_senior)
## job_title experience_level employment_type work_models work_year
## 1 Data Scientist Senior-level Full-time Remote 2024
## 2 Data Scientist Senior-level Full-time Remote 2024
## 3 Data Engineer Senior-level Full-time On-site 2024
## 4 Data Engineer Senior-level Full-time On-site 2024
## 5 MLOps Engineer Senior-level Full-time On-site 2024
## 6 MLOps Engineer Senior-level Full-time On-site 2024
## employee_residence salary salary_currency salary_in_usd company_location
## 1 United States 140032 USD 140032 United States
## 2 United States 100022 USD 100022 United States
## 3 United States 204662 USD 204662 United States
## 4 United States 184662 USD 184662 United States
## 5 United States 175000 USD 175000 United States
## 6 United States 110000 USD 110000 United States
## company_size
## 1 Medium
## 2 Medium
## 3 Medium
## 4 Medium
## 5 Medium
## 6 Medium
SE <- 70946.84/sqrt(4105)
Z <- (162071-145561)/SE
Z
## [1] 14.90976
library(DescTools)
p_value <- pnorm(Z, lower.tail = F)
p_value
## [1] 1.423912e-50
Insight: - If \(p-value\) < \(\alpha\), then reject \(H_0\), accept H1 - If \(p-value\) > \(\alpha\), then fail to reject \(H_0\)
Result: - alpha: 0.05 - p-value: 1.423912e-50
alpha > p-value
Conclusion: reject H0 accept H1 there is a significant increase in senior level salaries from 2020-2024.