The aim of the study is to find most relevant factors that influence
the
salaries of Data Scientists. In this paper we will use association rules
for
discovering frequent patters of occurrence.
Association rule learning is a rule-based machine
learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.
The analysis base on the dataset from kaggle: https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary?select=ds_salaries.csv. Initially the dataset contains 11 columns and 3755 records. The variables which might be found in the data are: work year, experience_level, employment_type, job_title and few other which are described in the table below:
| Variable | Explanation |
|---|---|
| work_year | The year the salary was paid (2020-2023) |
| experience_level | EN - Entry-level / Junior, MI - Mid-level / Intermediate, SE - Senior-level / Expert, EX - Director |
| employment_type | PT - Part-time, FT - Full-time, CT - Contract, FL - Freelance |
| job_title | The role worked in during the year |
| salary | The total gross salary amount paid |
| salary_currency | The currency of the salary paid as an ISO 4217 currency code |
| salary_in_usd | The salary in USD |
| employee_residence | Employee’s primary country of residence during the work year as an ISO 3166 country code |
| remote_ratio | The overall amount of work done remotely |
| company_location | The country of the employer’s main office or contracting branch |
| company_size | The median number of people that worked for the company during the year |
One challenge which won’t be handled in this article in interpreting the dataset lies in the variability of job titles. While there is a wide array of names, it’s essential to recognize that certain roles may essentially involve similar tasks despite having different titles. This diversity in nomenclature adds complexity to the analysis, requiring a nuanced understanding of the underlying tasks and responsibilities associated with each job. In further analysis it will be important to use data mining techniques.
# Libraries
library(ggplot2)
library(dplyr)
library(tidyverse)
library(arules)
library(arulesViz)
# Read the dataset
salaries <- read.csv("C:\\Users\\User\\Desktop\\ds_salaries.csv")
# We do not need columns salary, salary_currency, whereas we have column with salary exchanged onto usd (salary_in_usd)
salaries <- salaries[, -c(5,6)]
# Summarize the data
summary(salaries)
## work_year experience_level employment_type job_title
## Min. :2020 Length:3755 Length:3755 Length:3755
## 1st Qu.:2022 Class :character Class :character Class :character
## Median :2022 Mode :character Mode :character Mode :character
## Mean :2022
## 3rd Qu.:2023
## Max. :2023
## salary_in_usd employee_residence remote_ratio company_location
## Min. : 5132 Length:3755 Min. : 0.00 Length:3755
## 1st Qu.: 95000 Class :character 1st Qu.: 0.00 Class :character
## Median :135000 Mode :character Median : 0.00 Mode :character
## Mean :137570 Mean : 46.27
## 3rd Qu.:175000 3rd Qu.:100.00
## Max. :450000 Max. :100.00
## company_size
## Length:3755
## Class :character
## Mode :character
##
##
##
# Are there any missing values?
salaries[!complete.cases(salaries),]
## [1] work_year experience_level employment_type job_title
## [5] salary_in_usd employee_residence remote_ratio company_location
## [9] company_size
## <0 wierszy> (lub 'row.names' o zerowej długości)
There is always important to check whether the data contain missing values. In our case there are no missings so we can proceed with exploratory data analysis.
In the realm of data-related professions, the predominant role is that of a Data Engineer, with a count of 1040 occurrences. Following closely are Data Scientist and Data Analyst, with 840 and 612 instances, respectively. The fourth and fifth most frequent roles are Machine Learning Engineer and Analytics Engineer, with 289 and 103 occurrences.
Despite the dominance of these top roles, it’s noteworthy that the dataset encompasses a rich diversity of job titles, totaling 93 distinct names among its 3755 observations. The spectrum ranges from foundational titles like Data Scientist and Data Engineer to more specialized roles such as Staff Data Scientist or Head of Machine Learning.
barplot(sort(table(salaries$job_title), decreasing = T)[1:5], col = "tomato",
main = "Top 5 frequent work titles",
ylab = "Number of people",
xlab = "Work title",
ylim = c(0, 1200))
It is important to point out that we are not comparing the same observations during the time. The observations are independent so we should be cautious when we state the claims.
In the year 2020, the mean salary stands at $92,303, with a median salary of $73,065. Moving into 2021, we observe an increase in both metrics, with the mean salary climbing to $94,087 and the median salary reaching $80,000. The year 2022 unveils a notable surge in compensation, as the mean salary experiences a growth to $133,339, accompanied by a median salary of $131,300. Continuing this upward trajectory, the dataset reflects a further rise in salaries for 2023. The mean salary reaches $149,046, while the median salary registers at $143,860. This continued increase underscores a trend of escalating compensation levels within the dataset.
It’s noteworthy that the divergence between mean and median salaries can offer insights into the distribution’s shape. The expanding gap between these measures might indicate the presence of high outliers or a right-skewed distribution. Conversely, a closer alignment suggests a more symmetric distribution. We can see then from the visualization that as expected - for years 2020-2021 we observe more right-skewed distributions than for years 2022-2023.
The salary distribution exhibits a striking similarity between the years 2022 and 2023, suggesting a stable trend during this period. However, a noticeable distinction emerges when comparing the years 2021 and 2022-2023 the latter period reflects a lower mean salary. Examining the salary distribution for 2020 reveals a noteworthy difference. This particular year stands out for containing a higher number of outliers in comparison to 2021-2023. Additionally, the mean salary for 2020 is significantly lower than that of the subsequent years.
# Mean and median for salary via work_year
salaries %>%
group_by(work_year) %>%
summarize(mean_salary = mean(salary_in_usd),
median_salary = median(salary_in_usd))
## # A tibble: 4 × 3
## work_year mean_salary median_salary
## <int> <dbl> <dbl>
## 1 2020 92303. 73065
## 2 2021 94087. 80000
## 3 2022 133339. 131300
## 4 2023 149046. 143860
# Distribution of salary in usd via work_year
ggplot(salaries, aes(x = salary_in_usd, y = after_stat(density))) +
geom_histogram(color = "white", fill = "tomato") +
geom_density(color = "black", linewidth = 1) +
facet_grid(~work_year) +
ggtitle("Distribution of salary via work_year") +
xlab("Salary [USD]") +
ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))
Examining the salary distributions across different company sizes
reveals clear differences. Notably, the mean and median salaries are
highest for medium-sized (M) companies, while small-sized (S) companies
exhibit the lowest figures. It’s important to mention that all
distributions are skewed to the right, indicating a prevalence of higher
salaries in general.
Furthermore, the distributions for medium-sized (M) and small-sized (S) companies display a higher level of leptokurtosis compared to large-sized (L) companies. This means that salaries within M and S companies are more tightly clustered around the average, with a sharper peak in the distribution. On the other hand, salaries in L companies are more spread out.
# Mean and median for salary via company size
salaries %>%
group_by(company_size) %>%
summarize(mean_salary = mean(salary_in_usd),
median_salary = median(salary_in_usd))
## # A tibble: 3 × 3
## company_size mean_salary median_salary
## <chr> <dbl> <dbl>
## 1 L 118301. 108500
## 2 M 143131. 140000
## 3 S 78227. 62146
# Salary distribution via company size
ggplot(salaries, aes(x = salary_in_usd, y = after_stat(density))) +
geom_histogram(color = "white", fill = "tomato") +
geom_density(color = "black", linewidth = 1) +
facet_grid(~company_size) +
ggtitle("Distribution of salary via company_size") +
xlab("Salary [USD]") +
ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))
The highest-paying position in the dataset is the role of a Data Science Tech Lead, commanding an impressive average salary of $375,000. Following closely is the Cloud Data Architect position, with a substantial mean salary of $250,000. The third and fourth spots are occupied by the roles of Data Lead and Data Analytics Lead, offering mean salaries of $212,500 and $211,254.5, respectively.
Notably, the fifth position in the top 5 is held by the Principal Data Scientist role, with an average salary of $198,171.1. It’s worth highlighting that the salary for the Principal Data Scientist is more than twice lower than that of the top-paying position.
# Finding 5 work titles with highest salary
highest_salary <- as.data.frame(head(salaries %>%
group_by(job_title) %>%
summarize(mean_salary = mean(salary_in_usd)) %>%
arrange(desc(mean_salary)), 5))
# Barplot top 5 salaries
barplot(height = highest_salary$mean_salary, names.arg = highest_salary$job_title,
col = "tomato", main = "Top 5 Work titles with highest salary",
xlab = "Work title", ylab = "Salary level in USD", ylim = c(0,400000))
In this step we prepare the data to be transactional as it is relevant for association rules algorithm.
We perform following steps:
- change work_type variable to categorical
- change salary variable to categorical
- remove variables work_year, salary_in_usd, employee_residence and
remote_ratio
# Change work_type variable to categorical
salaries$work_type <- as.factor(ifelse(salaries$remote_ratio == 50, "Hybrid_work",
ifelse(salaries$remote_ratio == 100, "Remote_work", "Onsite_work")))
# Changing salary variable to categorical
salary_levels <- quantile(salaries$salary_in_usd, c(0.25, 0.5, 0.75))
salaries$salary <- as.factor(ifelse(salaries$salary_in_usd < salary_levels[1], "Low_salary",
ifelse(salaries$salary_in_usd < salary_levels[2], "Medium_salary",
ifelse(salaries$salary_in_usd < salary_levels[3], "Good_salary", "High_salary"))))
# Removing Unnecessary columns
salaries <- salaries[,-c(1,5,6,7)]
# Export prepared data to salaries_transactions.csv which then will be read as transactions
#write.csv(salaries, "C:\\Users\\User\\Desktop\\salaries_transactions.csv", row.names = F)
The algorithm which we use for gathering intristic association rules is APRIORI. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
The density of sparse item matrix is about 3.7% which stands for relative number of non-zero records in the matrix. Moving onto support - most frequent items are FT, M, US, SE, Onsite_work with 3718, 3153, 3040, 2518, 1923 occurrences respectively. It is important to point out that the transactions are homogeneous when it goes about size (all are 7 size).
# Read the data
salaries_to_rules <- read.transactions("C:\\Users\\User\\Desktop\\salaries_transactions.csv", sep = ",")
# Remove labelss
salaries_to_rules <- salaries_to_rules[-1,]
# Summarize the data
summary(salaries_to_rules)
## transactions as itemMatrix in sparse format with
## 3755 rows (elements/itemsets/transactions) and
## 188 columns (items) and a density of 0.03723404
##
## most frequent items:
## FT M US SE Onsite_work (Other)
## 3718 3153 3040 2518 1923 11933
##
## element (itemset/transaction) length distribution:
## sizes
## 7
## 3755
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7 7 7 7 7 7
##
## includes extended item information - examples:
## labels
## 1 3D Computer Vision Researcher
## 2 AE
## 3 AI Developer
From the visualization we can see that the item which occur most frequently in the dataset is FT (full time employment type). Then following variables are M (which stand for medium company size), US (company location - United States), SE (senior experience level) and Onsite_work.
# Visualization of top 10 items that occur most frequently in the dataset
itemFrequencyPlot(salaries_to_rules, topN = 10, col = "tomato", ylim = c(0,1))
The pattern seen on image plot with 100 random transactions shows that
the data contain some intrinsic association rules as there are few items
that occur in the data repeatedly.
image(sample(salaries_to_rules, 100))
Firstly, the APRIORI algorithm helps identify meaningful associations between items in a dataset. If a specific item or set of items is infrequent, it implies that any larger set containing that item or itemset will also be infrequent. In essence, the algorithm allows us to prune or eliminate less frequent itemsets efficiently by leveraging the relationship between itemsets and their subsets.
To perform APRIORI algorithm we must specify minimum support and condifence. To make expert judgements about the level which will be appropriate we perform some visualizations of support, confidence and lift. Based on the plot we will go for minimum support and minimum confidence about 20% and 50% respectively.
salaries_rules <- apriori(salaries_to_rules, parameter = list(support = 0.1, confidence = 0.15, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 375
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[181 item(s), 3755 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [695 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
plot(salaries_rules)
We perform APRIORI algorithm with minimum support 10%, minimum confidence 30% and 2 as minimum length of the rule. We would like to find most relevant factors that influence the salaries of Data Scientists so in that case we choose “High_salary” as consequent and try to find some antecedents that with high probability are followed by the level of salary.
# Apriori association rules algorithm
rules.highest_salary <- apriori(salaries_to_rules, parameter = list(support = 0.1,
confidence = 0.3, minlen = 2),
appearance = list(default = "lhs", rhs = "High_salary"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 375
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[181 item(s), 3755 transaction(s)] done [0.00s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Order rules by confidence
rules.highest_salary.byconf <- sort(rules.highest_salary, by = "confidence",
decreasing = T)
inspect(rules.highest_salary.byconf[1:10])
## lhs rhs support confidence coverage
## [1] {FT, SE, US} => {High_salary} 0.2109188 0.3505976 0.6015979
## [2] {SE, US} => {High_salary} 0.2109188 0.3501326 0.6023968
## [3] {FT, Remote_work, SE} => {High_salary} 0.1030626 0.3486486 0.2956059
## [4] {Remote_work, SE} => {High_salary} 0.1030626 0.3477089 0.2964048
## [5] {FT, M, SE, US} => {High_salary} 0.1914780 0.3448441 0.5552597
## [6] {M, SE, US} => {High_salary} 0.1914780 0.3445137 0.5557923
## [7] {Onsite_work, SE, US} => {High_salary} 0.1118509 0.3378922 0.3310253
## [8] {FT, Onsite_work, SE, US} => {High_salary} 0.1118509 0.3378922 0.3310253
## [9] {FT, M, Remote_work, US} => {High_salary} 0.1019973 0.3333333 0.3059920
## [10] {M, Remote_work, US} => {High_salary} 0.1019973 0.3321769 0.3070573
## lift count
## [1] 1.339261 792
## [2] 1.337485 792
## [3] 1.331817 387
## [4] 1.328227 387
## [5] 1.317284 719
## [6] 1.316021 719
## [7] 1.290728 420
## [8] 1.290728 420
## [9] 1.273313 383
## [10] 1.268896 383
The dataset indicates that the predominant company location is the United States, constituting over 80% of all locations. Consequently, it’s not surprising that the US appears in almost every rule. In 10,1% of all observations people work remotly in a medium-size company. Their likelihood of earning a high salary is about 33.2% . On the other hand 792 people work full time on the senior level in company located in US. The probability of having high salary for those people is about 35%. It is important to point out that this group has the highest probability of earning greatest amount of money (comparing among whole rules).
rules.highest_salary.bysupp <- sort(rules.highest_salary, by = "support",
decreasing = T)
inspect(rules.highest_salary.bysupp[1:10])
## lhs rhs support confidence coverage lift
## [1] {US} => {High_salary} 0.2500666 0.3088816 0.8095872 1.179909
## [2] {FT, US} => {High_salary} 0.2495340 0.3100596 0.8047936 1.184409
## [3] {M, US} => {High_salary} 0.2247670 0.3099523 0.7251664 1.183999
## [4] {FT, M, US} => {High_salary} 0.2247670 0.3104082 0.7241012 1.185741
## [5] {SE} => {High_salary} 0.2205060 0.3288324 0.6705726 1.256120
## [6] {FT, SE} => {High_salary} 0.2205060 0.3294867 0.6692410 1.258619
## [7] {SE, US} => {High_salary} 0.2109188 0.3501326 0.6023968 1.337485
## [8] {FT, SE, US} => {High_salary} 0.2109188 0.3505976 0.6015979 1.339261
## [9] {M, SE} => {High_salary} 0.1994674 0.3303926 0.6037284 1.262080
## [10] {FT, M, SE} => {High_salary} 0.1994674 0.3308304 0.6029294 1.263752
## count
## [1] 939
## [2] 937
## [3] 844
## [4] 844
## [5] 828
## [6] 828
## [7] 792
## [8] 792
## [9] 749
## [10] 749
plot(rules.highest_salary, method = "graph")
plot(rules.highest_salary, method="paracoord", control=list(reorder=TRUE))
Association rules are intrinsic in uncovering patterns or rules within a dataset, particularly when examining salary trends for Data Scientists. The analysis indicates that crucial antecedents contributing to a high probability of falling into the highest salary quantile are mostly full-time employment and holding a senior-level position. These findings align with expectations and are intuitive in the context of salary considerations for Data Scientists.
However, it’s essential to acknowledge certain limitations in the analysis. The overwhelming majority of observations pertain to company locations within the United States, which limits the generalizability of the findings to other locations. This skew towards the US in the dataset initially discounts insights into salary patterns in less represented locations.
A potential challenge for future analyses lies in the variability of job titles. While there is a wide array of names, it’s essential to recognize that certain roles may essentially involve similar tasks.