Analyzing the Evolution of Data Science Salaries from 2020 to 2023

Hello,

The “Data Science Salaries” dataset from Kaggle provides valuable insights into the compensation trends and variations in the field of data science from 2020 to 2023. This dataset encompasses a comprehensive collection of salary information from various industries, organizations, and geographic regions, enabling data professionals, researchers, and organizations to analyze and understand the prevailing salary landscape in the data science domain during this four-year period.


By examining this dataset, one can gain a deeper understanding of the factors influencing data science salaries, such as job roles, experience levels, educational backgrounds, and geographical locations. The dataset serves as a valuable resource for individuals seeking career guidance, companies aiming to benchmark their compensation strategies, and researchers investigating the evolving dynamics of the data science job market.

For any comments, please contact:

Duncan Kabiito Matovu,
Mobile +256787755590; Email:

Assessing the data set for outliers

Indeed there are some extreme values that we shall not remove at this point. I request we just dive in to interrogate further: What could be the hiring companies (different companies pay differently), what’s the level of expertise, Job title, etc

Average salary (yearly) paid to data professionals by county
Location Salary in USD (average)
Israel 217,332
Puerto Rico 167,500
United States 158,462
Saudi Arabia 134,999
Canada 134,550
New Zealand 125,000
Australia 122,134
Bosnia and Herzegovina 120,000
Russian Federation 119,500
Ireland 115,188
Japan 110,822
United Kingdom 108,425
Switzerland 101,659
Algeria 100,000
China 100,000
Iran, Islamic Republic of 100,000
Iraq 100,000
United Arab Emirates 100,000
Sweden 98,791
Mexico 94,865
Lithuania 94,812
Germany 92,568
Norway 88,462
Kenya 80,000
France 78,390
Belgium 76,865
Croatia 76,726
Netherlands 75,470
Ukraine 72,667
Austria 71,355

Israel and Puerto Rico

Level of expertise hired by Israel and Puerto Rico

United states and canada

I was quite amazed by the level of expertise and involvement in data prerequisites in United States and Canada

They both seem to have a versatile level of experts, but much of your focus should be placed on the number of people in the fields of Data Engineer to Machine learning Engineers

Below are top 10 jobs where people seem to be most involved (United States-Left, Canada-Right)

Of course this is a sample dataset but it sort of paints a picture

Data Analysts

Salaries of data analysts have been increasing over the years, but the increase is quite maginal

The level of expertise too is of note, experts seem to be be earning better off than even directors. In my context, a director is higher than an expert

Data scientists

It’s clear that data science is taking over, there was a sublime decline in their salaries in 2021 but wee them picking up in 2022 by nearly double what they were getting in 2021.

Expertise too is of note but here we see that directors are currently earning much higher than experts

Will be good to interogate the skills required for one to be a director, in the data science space

Linear regression

Exploring the Factors Affecting Salaries for data analysts, scientists and modellers: A Multiple Linear Regression Analysis

Model 1 (variables provided for are; Salaries in USD, expertise level, year and job title)

## 
## Call:
## lm(formula = salary_in_usd ~ factor(expertise_level) + factor(year) + 
##     factor(job_title), data = new_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128944  -33973   -4646   31100  353898 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             122806      15180   8.090 1.53e-15 ***
## factor(expertise_level)Expert           -18694      10752  -1.739 0.082363 .  
## factor(expertise_level)Intermediate     -63759      11017  -5.787 9.26e-09 ***
## factor(expertise_level)Junior           -88202      11739  -7.514 1.17e-13 ***
## factor(year)2021                         -4337      12054  -0.360 0.719054    
## factor(year)2022                         18022      10560   1.707 0.088178 .  
## factor(year)2023                         33188      10502   3.160 0.001620 ** 
## factor(job_title)Data Modeler            -4813      23525  -0.205 0.837924    
## factor(job_title)Data Modeller           -9183      37132  -0.247 0.804722    
## factor(job_title)Data Science Lead       60136      16740   3.592 0.000342 ***
## factor(job_title)Data Scientist          30512       3230   9.446  < 2e-16 ***
## factor(job_title)Data Specialist          4931      12244   0.403 0.687257    
## factor(job_title)Head of Data Science    47809      18179   2.630 0.008659 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52220 on 1128 degrees of freedom
## Multiple R-squared:  0.319,  Adjusted R-squared:  0.3118 
## F-statistic: 44.04 on 12 and 1128 DF,  p-value: < 2.2e-16

*** represents significance

The Adjusted R-squared indicates how much of the variation is explained by the model in respect to the dependent variable salaries across the independent variables used

Model 1 diagnostics

Model 2

## 
## Call:
## lm(formula = salary_in_usd ~ factor(expertise_level) + factor(job_title) + 
##     factor(company_size), data = new_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -126437  -34682   -4508   29097  344208 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           132879.0    12090.6  10.990  < 2e-16 ***
## factor(expertise_level)Expert         -19796.8    10767.1  -1.839 0.066232 .  
## factor(expertise_level)Intermediate   -66768.3    11009.3  -6.065 1.80e-09 ***
## factor(expertise_level)Junior         -87624.2    11800.0  -7.426 2.20e-13 ***
## factor(job_title)Data Modeler           -481.2    23537.3  -0.020 0.983694    
## factor(job_title)Data Modeller         -3707.0    37169.5  -0.100 0.920574    
## factor(job_title)Data Science Lead     63646.8    16763.6   3.797 0.000154 ***
## factor(job_title)Data Scientist        30530.3     3237.4   9.430  < 2e-16 ***
## factor(job_title)Data Specialist        5200.7    12276.2   0.424 0.671911    
## factor(job_title)Head of Data Science  46989.3    18184.9   2.584 0.009892 ** 
## factor(company_size)Medium             20648.4     4961.0   4.162 3.39e-05 ***
## factor(company_size)Small             -21923.5     9018.6  -2.431 0.015216 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52310 on 1129 degrees of freedom
## Multiple R-squared:  0.3161, Adjusted R-squared:  0.3094 
## F-statistic: 47.44 on 11 and 1129 DF,  p-value: < 2.2e-16

*** represents significance

The Adjusted R-squared indicates how much of the variation is explained by the model in respect to the dependent variable salaries across the independent variables used

Comapring the two models to get the most appropriate model

Null hypothesis: model 1 significantly improves fit than model 2

Alternative hypothesis: model 1 doesn’t not significantly improve fit than model 2

## Likelihood ratio test
## 
## Model 1: salary_in_usd ~ factor(expertise_level) + factor(year) + factor(job_title)
## Model 2: salary_in_usd ~ factor(expertise_level) + factor(job_title) + 
##     factor(company_size)
##   #Df LogLik Df  Chisq Pr(>Chisq)  
## 1  14 -14008                       
## 2  13 -14010 -1 4.9047    0.02678 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a P -Value of 0.02678 which is less than 0.05 at 95% Confidence Interval, we shall reject the null hypothesis and conclude that model 2 provides better predictions than model 1

Model 2 diagnostics

Note: Interpretation of model 2 coefficients can be provided on request

Feel free to reach me for any consultancies, thanks