Introduction

The data science field is constantly changing, and it’s important for professionals and organizations to understand salary trends. This dataset is focused on Data Science Salaries from 2020 to 2024, aiming to provide insights into salary trends, regional differences, and potential factors impacting compensation within the data science community.

The dataset covers a wide range of data science salary information over the course of five years, from 2020 to 2024, offering a comprehensive view of compensation in the field.

1. Import & Read Data

Project LBB (Programming for data science) this time I used data on the salary of the data expert profession from 2020 to 2024 in several countries in the world, the dataset was obtained from https://www.kaggle.com

The first step is to import the dataset using the read.csv() function.

data_salaries <- read.csv("input_data/data_science_salaries.csv")

This data_salaries dataset has 11 variables and 6599 observations. And from the dataset above, these are the description of each column.

Column Name Description
job_title The job title or role associated with the reported salary.
experience_level The level of experience of the individual.
employment_type Indicates whether the employment is full-time, part-time, etc.
work_models Describes different working models (remote, on-site, hybrid).
work_year The specific year in which the salary information was recorded.
employee_residence The residence location of the employee.
salary The reported salary in the original currency.
salary_currency The currency in which the salary is denominated.
salary_in_usd The converted salary in US dollars.
company_location The geographic location of the employing organization.
company_size The size of the company, categorized by the number of employees.

2. Inspect Data

a. Observation data

The next step is to investigate the imported dataset, because we want to observe the initial and final data of the data_salaries dataset. We use the head() and tail() functions.

Head() Function
head(data_salaries)
##        job_title experience_level employment_type work_models work_year
## 1  Data Engineer        Mid-level       Full-time      Remote      2024
## 2  Data Engineer        Mid-level       Full-time      Remote      2024
## 3 Data Scientist     Senior-level       Full-time      Remote      2024
## 4 Data Scientist     Senior-level       Full-time      Remote      2024
## 5   BI Developer        Mid-level       Full-time     On-site      2024
## 6   BI Developer        Mid-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 148100             USD        148100    United States
## 2      United States  98700             USD         98700    United States
## 3      United States 140032             USD        140032    United States
## 4      United States 100022             USD        100022    United States
## 5      United States 120000             USD        120000    United States
## 6      United States  62100             USD         62100    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium
Tail() Function
tail(data_salaries)
##                     job_title experience_level employment_type work_models
## 6594 Principal Data Scientist     Senior-level       Full-time      Remote
## 6595       Staff Data Analyst      Entry-level        Contract      Hybrid
## 6596       Staff Data Analyst  Executive-level       Full-time     On-site
## 6597 Machine Learning Manager     Senior-level       Full-time      Hybrid
## 6598            Data Engineer        Mid-level       Full-time      Hybrid
## 6599           Data Scientist     Senior-level       Full-time     On-site
##      work_year employee_residence salary salary_currency salary_in_usd
## 6594      2020            Germany 130000             EUR        148261
## 6595      2020             Canada  60000             CAD         44753
## 6596      2020            Nigeria  15000             USD         15000
## 6597      2020             Canada 157000             CAD        117104
## 6598      2020            Austria  65000             EUR         74130
## 6599      2020            Austria  80000             EUR         91237
##      company_location company_size
## 6594          Germany       Medium
## 6595           Canada        Large
## 6596           Canada       Medium
## 6597           Canada        Large
## 6598          Austria        Large
## 6599          Austria        Small

b. Data Structure

Checking and adjusting the data structure of each column is done using the str() function.

str(data_salaries)
## 'data.frame':    6599 obs. of  11 variables:
##  $ job_title         : chr  "Data Engineer" "Data Engineer" "Data Scientist" "Data Scientist" ...
##  $ experience_level  : chr  "Mid-level" "Mid-level" "Senior-level" "Senior-level" ...
##  $ employment_type   : chr  "Full-time" "Full-time" "Full-time" "Full-time" ...
##  $ work_models       : chr  "Remote" "Remote" "Remote" "Remote" ...
##  $ work_year         : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
##  $ employee_residence: chr  "United States" "United States" "United States" "United States" ...
##  $ salary            : int  148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
##  $ salary_currency   : chr  "USD" "USD" "USD" "USD" ...
##  $ salary_in_usd     : int  148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
##  $ company_location  : chr  "United States" "United States" "United States" "United States" ...
##  $ company_size      : chr  "Medium" "Medium" "Medium" "Medium" ...

c. Summary Data

summary(data_salaries)
##   job_title         experience_level   employment_type    work_models       
##  Length:6599        Length:6599        Length:6599        Length:6599       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    work_year    employee_residence     salary         salary_currency   
##  Min.   :2020   Length:6599        Min.   :   14000   Length:6599       
##  1st Qu.:2023   Class :character   1st Qu.:   96000   Class :character  
##  Median :2023   Mode  :character   Median :  140000   Mode  :character  
##  Mean   :2023                      Mean   :  179283                     
##  3rd Qu.:2023                      3rd Qu.:  187500                     
##  Max.   :2024                      Max.   :30400000                     
##  salary_in_usd    company_location   company_size      
##  Min.   : 15000   Length:6599        Length:6599       
##  1st Qu.: 95000   Class :character   Class :character  
##  Median :138666   Mode  :character   Mode  :character  
##  Mean   :145561                                        
##  3rd Qu.:185000                                        
##  Max.   :750000

3. Data Cleansing

a. Change The Data Type (Explicit Coercion)

From the data structure above, we do data cleansing from several columns that are not yet suitable to the data type. Here are some columns that will be adjusted to the data type as a factor.

  1. job_title
  2. experience_level
  3. employment_type
  4. work_models
  5. employee_residence
  6. salary_currency
  7. company_location
  8. company_size
data_salaries$job_title <- as.factor(data_salaries$job_title)
data_salaries$experience_level <- as.factor(data_salaries$experience_level)
data_salaries$employment_type <- as.factor(data_salaries$employment_type)
data_salaries$work_models <- as.factor(data_salaries$work_models)
data_salaries$employee_residence <- as.factor(data_salaries$employee_residence)
data_salaries$salary_currency <- as.factor(data_salaries$salary_currency)
data_salaries$company_location <- as.factor(data_salaries$company_location)
data_salaries$company_size <- as.factor(data_salaries$company_size)

The results of the column type adjustment can be checked again, who knows if there are columns that are still not suitable.

str(data_salaries)
## 'data.frame':    6599 obs. of  11 variables:
##  $ job_title         : Factor w/ 132 levels "AI Architect",..: 47 47 74 74 20 20 125 125 47 47 ...
##  $ experience_level  : Factor w/ 4 levels "Entry-level",..: 3 3 4 4 3 3 1 1 2 2 ...
##  $ employment_type   : Factor w/ 4 levels "Contract","Freelance",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ work_models       : Factor w/ 3 levels "Hybrid","On-site",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ work_year         : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
##  $ employee_residence: Factor w/ 87 levels "Algeria","American Samoa",..: 85 85 85 85 85 85 85 85 85 85 ...
##  $ salary            : int  148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
##  $ salary_currency   : Factor w/ 22 levels "AUD","BRL","CAD",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ salary_in_usd     : int  148100 98700 140032 100022 120000 62100 250000 150000 219650 136000 ...
##  $ company_location  : Factor w/ 75 levels "Algeria","Andorra",..: 74 74 74 74 74 74 74 74 74 74 ...
##  $ company_size      : Factor w/ 3 levels "Large","Medium",..: 2 2 2 2 2 2 2 2 2 2 ...

b. Check Missing Value

The purpose of checking for missing values is to know whether the dataset is clean from missing values or not. If there are missing values, handle the missing values appropriately to ensure the accuracy and reliability of your analysis.

To find out if our dataset has missing values with the anyNA() function

anyNA(data_salaries)
## [1] FALSE

To check for missing values and count them in columns, you can use the is.na() and colSums() functions.

colSums(is.na(data_salaries))
##          job_title   experience_level    employment_type        work_models 
##                  0                  0                  0                  0 
##          work_year employee_residence             salary    salary_currency 
##                  0                  0                  0                  0 
##      salary_in_usd   company_location       company_size 
##                  0                  0                  0

Insight : from the results of checking above, this dataset does not have missing values. This means we can proceed to the next stage of analysis.

4. Data Preparation

summary(data_salaries)
##                      job_title           experience_level  employment_type
##  Data Engineer            :1307   Entry-level    : 565    Contract :  19  
##  Data Scientist           :1243   Executive-level: 254    Freelance:  12  
##  Data Analyst             : 910   Mid-level      :1675    Full-time:6552  
##  Machine Learning Engineer: 629   Senior-level   :4105    Part-time:  16  
##  Analytics Engineer       : 246                                           
##  Research Scientist       : 206                                           
##  (Other)                  :2058                                           
##   work_models     work_year         employee_residence     salary        
##  Hybrid : 225   Min.   :2020   United States :5305     Min.   :   14000  
##  On-site:3813   1st Qu.:2023   United Kingdom: 401     1st Qu.:   96000  
##  Remote :2561   Median :2023   Canada        : 241     Median :  140000  
##                 Mean   :2023   Germany       :  71     Mean   :  179283  
##                 3rd Qu.:2023   India         :  70     3rd Qu.:  187500  
##                 Max.   :2024   Spain         :  67     Max.   :30400000  
##                                (Other)       : 444                       
##  salary_currency salary_in_usd          company_location company_size 
##  USD    :5827    Min.   : 15000   United States :5354    Large : 569  
##  GBP    : 334    1st Qu.: 95000   United Kingdom: 408    Medium:5860  
##  EUR    : 292    Median :138666   Canada        : 243    Small : 170  
##  INR    :  51    Mean   :145561   Germany       :  78                 
##  CAD    :  39    3rd Qu.:185000   Spain         :  63                 
##  AUD    :  11    Max.   :750000   India         :  58                 
##  (Other):  45                     (Other)       : 395

Here is the information we can use for further analysis:

  1. This data only provides information from 2020 to 2024

  2. There are 4 levels of experience which are Senior Level (SE), Entry Level (EN), Executive Level (EX), and Mid-Level (MI). The largest number of workers is at the senior level with 4105 workers and the smallest at the executive level with 254 workers

  3. There are 4 tipes of employment which are ful-time (FT), contract (CT), part-time (PT), and freelance (FL). Full-time workers have the largest number at 6552 workers and freelance workers have the smallest number at 12 workers.

  4. There are several positions, including Data Engineer, Data Scientist, Data Analyst, Machine Learning Engineer, Research Scientist, Analytics Engineer and many others.

  5. From all levels of workers, the minimum salary range was obtained of 14,000, the average salary received was 179,283 and the highest salary was 30,400,000. However, the information above is still in different currencies.

  6. The most widely used currencies for salary payments are USD, GBP, EUR, INR, CAD and AUD.

  7. Salary payments using USD currency have a minimum value of as much as 15,000 , average 145,561 and the maximum value is as much as 750,000.

  8. The location where the most employees live is from these 6 countries, United States (US), United Kingdom (GB), Canada (CA), Germany (DE), India (IN), Spain (ES) and many more.

  9. The average number of people that worked for the company during the year; S : less than 50 employees (small), M : 50 to 250 employees (medium), L : more than 250 employees (large). There are many medium-scale companies as many as 5860 and a few from small-scale companies as many as 170.

  10. The country of the employer’s main office or contracting branch most are found in countries United States (US), United Kingdom (GB), Canada (CA), Spain (ES).

  11. Describes different working models (remote, on-site, hybrid), most of the companies working models are on-site (3.813) and smallest is hybrid (225).

5. Exploratory Data Analysis (EDA)

There are 3 things in descriptive statistics that can be explored:

  • Measure of Central Tendency
  • Measure of Spread
  • Variable Relationship

a. Descriptive Statistics:

1.Business Question

To understand how the average salary has evolved over the years and whether there are any noticeable patterns or fluctuations. Analyzing these trends can provide insight into compensation strategies, employee retention, and overall workforce satisfaction. Based on the business question, we want to know the trend of the average salary of the jobs in the data sector from 2020-2024.

Descriptive statistics summarize and describe the main features of the data. We’ll use these to understand the central tendency, variability, and distribution of salaries over different work years.

Steps using descriptive statistics:

1. Measure of Central Tendency

Calculate mean of salary_in_usd (if there is no outliers) and median value of salary_in_usd (there are outliers)

# Calculate mean value from salary in USD
mean(data_salaries$salary_in_usd)
## [1] 145560.6
# Caalculate median value from salary in USD
median(data_salaries$salary_in_usd)
## [1] 138666

Insight : There is a difference in value between the average and the median. This difference is due to the fact that salary data in USD has an outlier. Another way to check the presence of outliers in the data is to create a boxplot.

2. Measure of Spread

To describe the deviation of the value from the mean point, the sd() function can be used

sd(data_salaries$salary_in_usd)
## [1] 70946.84

Insight : The interpretation of the normal value is from the mean value +/- standard deviation

So we can calculate the lower limit and upper limit by using the mean value

# Calculate the lower limit and upper limit of the salary_in_USD value
lower = 145560.6 - 70946.84
lower
## [1] 74613.76
upper = 145560.6 + 70946.84
upper
## [1] 216507.4

Insight :

  1. The range of mean value of salary ranges from 74,613.76 USD - 216,507.4 USD

  2. We can create box plots to visualize salary distributions.

Box Plot

Another way to check the presence of outliers in the data is to create a boxplot.

boxplot(data_salaries$salary_in_usd, horizontal = T)

summary(data_salaries$salary_in_usd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15000   95000  138666  145561  185000  750000
IQR(data_salaries$salary_in_usd)
## [1] 90000

Insight from the box plot above :

  1. IQR = 90,0000, this means central 50% of the data falls around this value

  2. Since mean value is greater than median value, the distribution tends to be right-skewed. This could happens because of the presence of the outliers ( max value 750,000)

b. Inferential Statistics:

Inferential statistics allow us to make inferences or predictions about a population based on a sample. We’ll use these to draw conclusions about salary trends over time.

Analysis from business questions

Based on a dataset from Data Science Salaries from 2020 to 2024, it is known that the number of workers with senior-level is 4105 people with an average salary from 2020-2024 of 162,071 USD. Meanwhile, the average salary of all workers is 145,561 USD with a standard deviation of 70946.84.

A newly established IT company in Indonesia wants to know whether from 2020 to 2024, the salary of senior-level workers has increased significantly? By using a 95% confidence level

Answer

H0 : There is no increase in senior-level salaries from 2020-2024

H1 : There is a significant increase in senior-level salaries from 2020-2024

  • Mean Populasi : 145561
  • SD populasi : 70946.84
  • populasi : 6599
  • n (senior level) : 4105
  • mean sample : 162071
  • confidence level : 95%

Find: Z value and SE

library(dplyr)
#calculate mean sample = average salary of senior level
data_salary_senior <-
  data_salaries %>%
  filter(experience_level == "Senior-level")
summary(data_salary_senior$salary_in_usd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15809  118000  153600  162071  199200  750000
head(data_salary_senior)
##        job_title experience_level employment_type work_models work_year
## 1 Data Scientist     Senior-level       Full-time      Remote      2024
## 2 Data Scientist     Senior-level       Full-time      Remote      2024
## 3  Data Engineer     Senior-level       Full-time     On-site      2024
## 4  Data Engineer     Senior-level       Full-time     On-site      2024
## 5 MLOps Engineer     Senior-level       Full-time     On-site      2024
## 6 MLOps Engineer     Senior-level       Full-time     On-site      2024
##   employee_residence salary salary_currency salary_in_usd company_location
## 1      United States 140032             USD        140032    United States
## 2      United States 100022             USD        100022    United States
## 3      United States 204662             USD        204662    United States
## 4      United States 184662             USD        184662    United States
## 5      United States 175000             USD        175000    United States
## 6      United States 110000             USD        110000    United States
##   company_size
## 1       Medium
## 2       Medium
## 3       Medium
## 4       Medium
## 5       Medium
## 6       Medium
SE <- 70946.84/sqrt(4105)

Z <- (162071-145561)/SE

Z
## [1] 14.90976
library(DescTools)
p_value <- pnorm(Z, lower.tail = F)
p_value
## [1] 1.423912e-50

Insight: - If \(p-value\) < \(\alpha\), then reject \(H_0\), accept H1 - If \(p-value\) > \(\alpha\), then fail to reject \(H_0\)

Result: - alpha: 0.05 - p-value: 1.423912e-50

alpha > p-value

Conclusion: reject H0 accept H1 there is a significant increase in senior level salaries from 2020-2024.