Introduction

The aim of the study is to find most relevant factors that influence the
salaries of Data Scientists. In this paper we will use association rules for
discovering frequent patters of occurrence.

Association rule learning is a rule-based machine
learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

The analysis base on the dataset from kaggle: https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary?select=ds_salaries.csv. Initially the dataset contains 11 columns and 3755 records. The variables which might be found in the data are: work year, experience_level, employment_type, job_title and few other which are described in the table below:

Variable	Explanation
work_year	The year the salary was paid (2020-2023)
experience_level	EN - Entry-level / Junior, MI - Mid-level / Intermediate, SE - Senior-level / Expert, EX - Director
employment_type	PT - Part-time, FT - Full-time, CT - Contract, FL - Freelance
job_title	The role worked in during the year
salary	The total gross salary amount paid
salary_currency	The currency of the salary paid as an ISO 4217 currency code
salary_in_usd	The salary in USD
employee_residence	Employee’s primary country of residence during the work year as an ISO 3166 country code
remote_ratio	The overall amount of work done remotely
company_location	The country of the employer’s main office or contracting branch
company_size	The median number of people that worked for the company during the year

One challenge which won’t be handled in this article in interpreting the dataset lies in the variability of job titles. While there is a wide array of names, it’s essential to recognize that certain roles may essentially involve similar tasks despite having different titles. This diversity in nomenclature adds complexity to the analysis, requiring a nuanced understanding of the underlying tasks and responsibilities associated with each job. In further analysis it will be important to use data mining techniques.

# Libraries
library(ggplot2)
library(dplyr)
library(tidyverse)
library(arules)
library(arulesViz)

# Read the dataset
salaries <- read.csv("C:\\Users\\User\\Desktop\\ds_salaries.csv")
# We do not need columns salary, salary_currency, whereas we have column with salary exchanged onto usd (salary_in_usd)
salaries <- salaries[, -c(5,6)]
# Summarize the data
summary(salaries)

##    work_year    experience_level   employment_type     job_title        
##  Min.   :2020   Length:3755        Length:3755        Length:3755       
##  1st Qu.:2022   Class :character   Class :character   Class :character  
##  Median :2022   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2022                                                           
##  3rd Qu.:2023                                                           
##  Max.   :2023                                                           
##  salary_in_usd    employee_residence  remote_ratio    company_location  
##  Min.   :  5132   Length:3755        Min.   :  0.00   Length:3755       
##  1st Qu.: 95000   Class :character   1st Qu.:  0.00   Class :character  
##  Median :135000   Mode  :character   Median :  0.00   Mode  :character  
##  Mean   :137570                      Mean   : 46.27                     
##  3rd Qu.:175000                      3rd Qu.:100.00                     
##  Max.   :450000                      Max.   :100.00                     
##  company_size      
##  Length:3755       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

# Are there any missing values?
salaries[!complete.cases(salaries),]

## [1] work_year          experience_level   employment_type    job_title         
## [5] salary_in_usd      employee_residence remote_ratio       company_location  
## [9] company_size      
## <0 wierszy> (lub 'row.names' o zerowej długości)

There is always important to check whether the data contain missing values. In our case there are no missings so we can proceed with exploratory data analysis.

Exploratory Data Analysis

In the realm of data-related professions, the predominant role is that of a Data Engineer, with a count of 1040 occurrences. Following closely are Data Scientist and Data Analyst, with 840 and 612 instances, respectively. The fourth and fifth most frequent roles are Machine Learning Engineer and Analytics Engineer, with 289 and 103 occurrences.

Despite the dominance of these top roles, it’s noteworthy that the dataset encompasses a rich diversity of job titles, totaling 93 distinct names among its 3755 observations. The spectrum ranges from foundational titles like Data Scientist and Data Engineer to more specialized roles such as Staff Data Scientist or Head of Machine Learning.

barplot(sort(table(salaries$job_title), decreasing = T)[1:5], col = "tomato",
        main = "Top 5 frequent work titles",
        ylab = "Number of people", 
        xlab = "Work title",
        ylim = c(0, 1200))

It is important to point out that we are not comparing the same observations during the time. The observations are independent so we should be cautious when we state the claims.

In the year 2020, the mean salary stands at $92,303, with a median salary of $73,065. Moving into 2021, we observe an increase in both metrics, with the mean salary climbing to $94,087 and the median salary reaching $80,000. The year 2022 unveils a notable surge in compensation, as the mean salary experiences a growth to $133,339, accompanied by a median salary of $131,300. Continuing this upward trajectory, the dataset reflects a further rise in salaries for 2023. The mean salary reaches $149,046, while the median salary registers at $143,860. This continued increase underscores a trend of escalating compensation levels within the dataset.

It’s noteworthy that the divergence between mean and median salaries can offer insights into the distribution’s shape. The expanding gap between these measures might indicate the presence of high outliers or a right-skewed distribution. Conversely, a closer alignment suggests a more symmetric distribution. We can see then from the visualization that as expected - for years 2020-2021 we observe more right-skewed distributions than for years 2022-2023.

The salary distribution exhibits a striking similarity between the years 2022 and 2023, suggesting a stable trend during this period. However, a noticeable distinction emerges when comparing the years 2021 and 2022-2023 the latter period reflects a lower mean salary. Examining the salary distribution for 2020 reveals a noteworthy difference. This particular year stands out for containing a higher number of outliers in comparison to 2021-2023. Additionally, the mean salary for 2020 is significantly lower than that of the subsequent years.

# Mean and median for salary via work_year
salaries %>%
  group_by(work_year) %>%
  summarize(mean_salary = mean(salary_in_usd),
            median_salary = median(salary_in_usd))

## # A tibble: 4 × 3
##   work_year mean_salary median_salary
##       <int>       <dbl>         <dbl>
## 1      2020      92303.         73065
## 2      2021      94087.         80000
## 3      2022     133339.        131300
## 4      2023     149046.        143860

# Distribution of salary in usd via work_year
ggplot(salaries, aes(x = salary_in_usd, y = after_stat(density))) +
  geom_histogram(color = "white", fill = "tomato") +
  geom_density(color = "black", linewidth = 1) +
  facet_grid(~work_year) +
  ggtitle("Distribution of salary via work_year") +
  xlab("Salary [USD]") +
  ylab("Density") +
  theme(plot.title = element_text(hjust = 0.5))

Examining the salary distributions across different company sizes reveals clear differences. Notably, the mean and median salaries are highest for medium-sized (M) companies, while small-sized (S) companies exhibit the lowest figures. It’s important to mention that all distributions are skewed to the right, indicating a prevalence of higher salaries in general.

Furthermore, the distributions for medium-sized (M) and small-sized (S) companies display a higher level of leptokurtosis compared to large-sized (L) companies. This means that salaries within M and S companies are more tightly clustered around the average, with a sharper peak in the distribution. On the other hand, salaries in L companies are more spread out.

# Mean and median for salary via company size
salaries %>%
  group_by(company_size) %>%
  summarize(mean_salary = mean(salary_in_usd),
            median_salary = median(salary_in_usd))

## # A tibble: 3 × 3
##   company_size mean_salary median_salary
##   <chr>              <dbl>         <dbl>
## 1 L                118301.        108500
## 2 M                143131.        140000
## 3 S                 78227.         62146

# Salary distribution via company size
ggplot(salaries, aes(x = salary_in_usd, y = after_stat(density))) +
  geom_histogram(color = "white", fill = "tomato") +
  geom_density(color = "black", linewidth = 1) +
  facet_grid(~company_size) +
  ggtitle("Distribution of salary via company_size") +
  xlab("Salary [USD]") +
  ylab("Density") + 
  theme(plot.title = element_text(hjust = 0.5))

The highest-paying position in the dataset is the role of a Data Science Tech Lead, commanding an impressive average salary of $375,000. Following closely is the Cloud Data Architect position, with a substantial mean salary of $250,000. The third and fourth spots are occupied by the roles of Data Lead and Data Analytics Lead, offering mean salaries of $212,500 and $211,254.5, respectively.

Notably, the fifth position in the top 5 is held by the Principal Data Scientist role, with an average salary of $198,171.1. It’s worth highlighting that the salary for the Principal Data Scientist is more than twice lower than that of the top-paying position.

# Finding 5 work titles with highest salary
highest_salary <- as.data.frame(head(salaries %>%
  group_by(job_title) %>%
  summarize(mean_salary = mean(salary_in_usd)) %>%
  arrange(desc(mean_salary)), 5))
# Barplot top 5 salaries
barplot(height = highest_salary$mean_salary, names.arg = highest_salary$job_title, 
        col = "tomato", main = "Top 5 Work titles with highest salary", 
        xlab = "Work title", ylab = "Salary level in USD", ylim = c(0,400000))

Data preparation

In this step we prepare the data to be transactional as it is relevant for association rules algorithm.

We perform following steps:
- change work_type variable to categorical
- change salary variable to categorical
- remove variables work_year, salary_in_usd, employee_residence and remote_ratio

# Change work_type variable to categorical
salaries$work_type <- as.factor(ifelse(salaries$remote_ratio == 50, "Hybrid_work", 
                               ifelse(salaries$remote_ratio == 100, "Remote_work", "Onsite_work")))

# Changing salary variable to categorical
salary_levels <- quantile(salaries$salary_in_usd, c(0.25, 0.5, 0.75))
salaries$salary <- as.factor(ifelse(salaries$salary_in_usd < salary_levels[1], "Low_salary", 
                  ifelse(salaries$salary_in_usd < salary_levels[2], "Medium_salary", 
                  ifelse(salaries$salary_in_usd < salary_levels[3], "Good_salary", "High_salary"))))

# Removing Unnecessary columns
salaries <- salaries[,-c(1,5,6,7)]

# Export prepared data to salaries_transactions.csv which then will be read as transactions
#write.csv(salaries, "C:\\Users\\User\\Desktop\\salaries_transactions.csv", row.names = F)

Association rules

The algorithm which we use for gathering intristic association rules is APRIORI. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.

The density of sparse item matrix is about 3.7% which stands for relative number of non-zero records in the matrix. Moving onto support - most frequent items are FT, M, US, SE, Onsite_work with 3718, 3153, 3040, 2518, 1923 occurrences respectively. It is important to point out that the transactions are homogeneous when it goes about size (all are 7 size).

# Read the data
salaries_to_rules <- read.transactions("C:\\Users\\User\\Desktop\\salaries_transactions.csv", sep = ",")
# Remove labelss
salaries_to_rules <- salaries_to_rules[-1,]
# Summarize the data
summary(salaries_to_rules)

## transactions as itemMatrix in sparse format with
##  3755 rows (elements/itemsets/transactions) and
##  188 columns (items) and a density of 0.03723404 
## 
## most frequent items:
##          FT           M          US          SE Onsite_work     (Other) 
##        3718        3153        3040        2518        1923       11933 
## 
## element (itemset/transaction) length distribution:
## sizes
##    7 
## 3755 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       7       7       7       7       7       7 
## 
## includes extended item information - examples:
##                          labels
## 1 3D Computer Vision Researcher
## 2                            AE
## 3                  AI Developer

From the visualization we can see that the item which occur most frequently in the dataset is FT (full time employment type). Then following variables are M (which stand for medium company size), US (company location - United States), SE (senior experience level) and Onsite_work.

# Visualization of top 10 items that occur most frequently in the dataset
itemFrequencyPlot(salaries_to_rules, topN = 10, col = "tomato", ylim = c(0,1))

The pattern seen on image plot with 100 random transactions shows that the data contain some intrinsic association rules as there are few items that occur in the data repeatedly.

image(sample(salaries_to_rules, 100))

APRIORI

Firstly, the APRIORI algorithm helps identify meaningful associations between items in a dataset. If a specific item or set of items is infrequent, it implies that any larger set containing that item or itemset will also be infrequent. In essence, the algorithm allows us to prune or eliminate less frequent itemsets efficiently by leveraging the relationship between itemsets and their subsets.

To perform APRIORI algorithm we must specify minimum support and condifence. To make expert judgements about the level which will be appropriate we perform some visualizations of support, confidence and lift. Based on the plot we will go for minimum support and minimum confidence about 20% and 50% respectively.

salaries_rules <- apriori(salaries_to_rules, parameter = list(support = 0.1, confidence = 0.15, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 375 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[181 item(s), 3755 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [695 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

plot(salaries_rules)

We perform APRIORI algorithm with minimum support 10%, minimum confidence 30% and 2 as minimum length of the rule. We would like to find most relevant factors that influence the salaries of Data Scientists so in that case we choose “High_salary” as consequent and try to find some antecedents that with high probability are followed by the level of salary.

# Apriori association rules algorithm
rules.highest_salary <- apriori(salaries_to_rules, parameter = list(support = 0.1, 
                                confidence = 0.3, minlen = 2),
                                appearance = list(default = "lhs", rhs = "High_salary"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 375 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[181 item(s), 3755 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Analysis of high salary as consequent

# Order rules by confidence
rules.highest_salary.byconf <- sort(rules.highest_salary, by = "confidence", 
                                    decreasing = T)
inspect(rules.highest_salary.byconf[1:10])

##      lhs                          rhs           support   confidence coverage 
## [1]  {FT, SE, US}              => {High_salary} 0.2109188 0.3505976  0.6015979
## [2]  {SE, US}                  => {High_salary} 0.2109188 0.3501326  0.6023968
## [3]  {FT, Remote_work, SE}     => {High_salary} 0.1030626 0.3486486  0.2956059
## [4]  {Remote_work, SE}         => {High_salary} 0.1030626 0.3477089  0.2964048
## [5]  {FT, M, SE, US}           => {High_salary} 0.1914780 0.3448441  0.5552597
## [6]  {M, SE, US}               => {High_salary} 0.1914780 0.3445137  0.5557923
## [7]  {Onsite_work, SE, US}     => {High_salary} 0.1118509 0.3378922  0.3310253
## [8]  {FT, Onsite_work, SE, US} => {High_salary} 0.1118509 0.3378922  0.3310253
## [9]  {FT, M, Remote_work, US}  => {High_salary} 0.1019973 0.3333333  0.3059920
## [10] {M, Remote_work, US}      => {High_salary} 0.1019973 0.3321769  0.3070573
##      lift     count
## [1]  1.339261 792  
## [2]  1.337485 792  
## [3]  1.331817 387  
## [4]  1.328227 387  
## [5]  1.317284 719  
## [6]  1.316021 719  
## [7]  1.290728 420  
## [8]  1.290728 420  
## [9]  1.273313 383  
## [10] 1.268896 383

The dataset indicates that the predominant company location is the United States, constituting over 80% of all locations. Consequently, it’s not surprising that the US appears in almost every rule. In 10,1% of all observations people work remotly in a medium-size company. Their likelihood of earning a high salary is about 33.2% . On the other hand 792 people work full time on the senior level in company located in US. The probability of having high salary for those people is about 35%. It is important to point out that this group has the highest probability of earning greatest amount of money (comparing among whole rules).

rules.highest_salary.bysupp <- sort(rules.highest_salary, by = "support", 
                                    decreasing = T)
inspect(rules.highest_salary.bysupp[1:10])

##      lhs             rhs           support   confidence coverage  lift    
## [1]  {US}         => {High_salary} 0.2500666 0.3088816  0.8095872 1.179909
## [2]  {FT, US}     => {High_salary} 0.2495340 0.3100596  0.8047936 1.184409
## [3]  {M, US}      => {High_salary} 0.2247670 0.3099523  0.7251664 1.183999
## [4]  {FT, M, US}  => {High_salary} 0.2247670 0.3104082  0.7241012 1.185741
## [5]  {SE}         => {High_salary} 0.2205060 0.3288324  0.6705726 1.256120
## [6]  {FT, SE}     => {High_salary} 0.2205060 0.3294867  0.6692410 1.258619
## [7]  {SE, US}     => {High_salary} 0.2109188 0.3501326  0.6023968 1.337485
## [8]  {FT, SE, US} => {High_salary} 0.2109188 0.3505976  0.6015979 1.339261
## [9]  {M, SE}      => {High_salary} 0.1994674 0.3303926  0.6037284 1.262080
## [10] {FT, M, SE}  => {High_salary} 0.1994674 0.3308304  0.6029294 1.263752
##      count
## [1]  939  
## [2]  937  
## [3]  844  
## [4]  844  
## [5]  828  
## [6]  828  
## [7]  792  
## [8]  792  
## [9]  749  
## [10] 749

plot(rules.highest_salary, method = "graph")

plot(rules.highest_salary, method="paracoord", control=list(reorder=TRUE))

Summary

Association rules are intrinsic in uncovering patterns or rules within a dataset, particularly when examining salary trends for Data Scientists. The analysis indicates that crucial antecedents contributing to a high probability of falling into the highest salary quantile are mostly full-time employment and holding a senior-level position. These findings align with expectations and are intuitive in the context of salary considerations for Data Scientists.

However, it’s essential to acknowledge certain limitations in the analysis. The overwhelming majority of observations pertain to company locations within the United States, which limits the generalizability of the findings to other locations. This skew towards the US in the dataset initially discounts insights into salary patterns in less represented locations.

A potential challenge for future analyses lies in the variability of job titles. While there is a wide array of names, it’s essential to recognize that certain roles may essentially involve similar tasks.

Association rule learning

Hubert Magdziak

2024-01-31