Predictive Modeling of Data Scientist Salary

In this report I will attempt to predict salary for a Data Scientist on a dataset of salary scraped from glassdoor pertaining to analytics roles. Based on the factors included in this model I will attempt to assess the most important factors towards getting a higher salary in a Data Scientist Role, as well as building a model that accurately predicts salary. I will perform data cleaning operations on the dataset, create models, and assess their performance in relation to one another.

General Notes

I will be evaluating predictive performance across all models in terms of RMSE (Root Mean Square Error) in order to get a measure, in original units of $ (thousands), that describes the predicted vs. actual salary.

I will also be ignoring titles in this data set that are not strictly related to the “Data Scientist” role. The accuracy of the model experiences a 100% increase when the dataset is filtered to describe only data scientist roles and there is a lot of variation in the skills expected of a Data Scientist vs. ML Engineer vs. Data Engineer vs. Data Analyst. The model is most pertinent and the findings are most relevant when focused on Data Scientist information directly, therefore, that will be my approach.

Packages

#pkgs <- c("car", "magrittr", "earth", "tidyverse", "lubridate", "glue", "vip", "stringr", "vip", "caret", "kableExtra", "rattle", "dplyr", "ggplot2", "doParallel", "foreach", "rpart", "ipred", "janitor", "ranger", "h20")
#lib <- installed.packages()[, "Package"]
#install.packages(setdiff(pkgs, lib))
library(car)
library(magrittr)
library(earth)
library(tidyverse)
library(lubridate)
library(glue)
library(stringr)
library(vip)
library(caret)
library(kableExtra)
library(rattle)
library(dplyr)       # for data wrangling
library(ggplot2)     # for awesome plotting
library(doParallel)  # for parallel backend to foreach
library(foreach)     # for parallel processing with for loops
library(rpart)       # for fitting decision trees
library(ipred)       # for fitting bagged decision trees
library(janitor)
library(ranger)   # a c++ implementation of random forest 
library(h2o)      # a java-based implementation of random forest

Data Manipulation

For data minipulation - the main tasks were to cast the categorical data types within the dataset as factors to make them useful later on. I’ve also created data subsets tailored to the tree models and MARS models that I’ll be developing later on. Scaling the salary of a record based on location (state & city, if available) cost of living index data is very important to accounting for legitimate differences in the value of a job offer, beyond just the expected salary increase in more expensive locations.

Data Summarization

data.frame(table(df$job_title_sim)) %>%
  kbl(caption = "Frequency by Simplified Job Title for All Jobs") %>%
  kable_styling(full_width = F)  %>% 
  kable_paper("hover", full_width = F) %>% 
  kable_classic(full_width = F, html_font = "Cambria")

Frequency by Simplified Job Title for All Jobs
Var1	Freq
analyst	101
data analitics	8
data engineer	119
data modeler	5
data scientist	313
Data scientist project manager	16
director	5
machine learning engineer	22
other scientist	143

data.frame(table(data_scientist$seniority_by_title)) %>%
  kbl(caption = "Frequency by Level of Seniority for Data Scientists") %>%
  kable_styling(full_width = F)  %>% 
  kable_paper("hover", full_width = F) %>% 
  kable_classic(full_width = F, html_font = "Cambria")

Frequency by Level of Seniority for Data Scientists
Var1	Freq
jr	1
na	200
sr	92

summary(data_scientist) %>%
  kbl(caption = "Summary of Data Scientist Salary Data") %>%
  kable_styling(full_width = F) %>%
  kable_paper("hover", full_width = F) %>%
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  scroll_box(width = "1000px", height = "500px")

Summary of Data Scientist Salary Data
Job Title	Rating	Location	Size	Type of ownership	Industry	Revenue	Competitors	Hourly	Employer provided	company_txt	State	Age	Python	spark	aws	excel	sql	sas	keras	pytorch	scikit	tensor	hadoop	tableau	bi	flink	mongo	google_an	job_title_sim	seniority_by_title	Degree	Skillset	scaled_salary
Length:293	Min. :-1.000	Length:293	Length:293	Length:293	Length:293	Length:293	Length:293	0:293	0:290	Length:293	Length:293	Min. : -1.00	0: 54	0:209	0:216	0:140	0:118	0:243	0:264	0:260	0:246	0:234	0:233	0:217	0:265	0:289	0:274	0:290	Length:293	Length:293	Length:293	Length:293	Min. : 15.97
Class :character	1st Qu.: 3.500	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1: 0	1: 3	Class :character	Class :character	1st Qu.: 12.00	1:239	1: 84	1: 77	1:153	1:175	1: 50	1: 29	1: 33	1: 47	1: 59	1: 60	1: 76	1: 28	1: 4	1: 19	1: 3	Class :character	Class :character	Class :character	Class :character	1st Qu.: 83.94
Mode :character	Median : 3.800	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	NA	NA	Mode :character	Mode :character	Median : 26.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mode :character	Mode :character	Mode :character	Mode :character	Median : 98.97
NA	Mean : 3.762	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean : 48.87	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean :103.49
NA	3rd Qu.: 4.100	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 63.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:120.60
NA	Max. : 5.000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :277.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :253.75

# create a metric that identifies whether a salary was above or below scaled salary average for the dataset
head(data_scientist_tree) %>%
  kbl(caption = "Summary of Tree Based Data Subset") %>%
  kable_styling(full_width = F)  %>% 
  kable_paper("hover", full_width = F) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  scroll_box(width = "1000px", height = "500px")

Summary of Tree Based Data Subset
Job Title	Rating	Location	Size	Type of ownership	Industry	Sector	Revenue	Competitors	company_txt	State	Age	Python	spark	aws	excel	sql	sas	tableau	bi	mongo	job_title_sim	seniority_by_title	Degree	Skillset	scaled_salary
Data Scientist	3.8	Albuquerque, NM	501 - 1000	Company - Private	Aerospace & Defense	Aerospace & Defense	$50 to $100 million (USD)	none	Tecolote Research	NM	48	1	0	0	1	0	1	1	1	0	data scientist	na	M	Python,excel,sas,tableau,bi	77.50269
Healthcare Data Scientist	3.4	Linthicum, MD	10000+	Other Organization	Health Care Services & Hospitals	Health Care	$2 to $5 billion (USD)	none	University of Maryland Medical System	MD	37	1	0	0	0	0	0	0	0	0	data scientist	na	M	Python	86.53085
Data Scientist	4.8	Clearwater, FL	501 - 1000	Company - Private	Security Services	Business Services	$100 to $500 million (USD)	none	KnowBe4	FL	11	1	1	0	1	1	1	0	0	0	data scientist	na	M	Python,spark,excel,sql,sas	83.93901
Data Scientist	3.8	Richland, WA	1001 - 5000	Government	Energy	Oil, Gas, Energy & Utilities	$500 million to $1 billion (USD)	Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory	PNNL	WA	56	1	0	0	0	0	0	0	0	0	data scientist	na	na	Python	71.47961
Data Scientist	2.9	New York, NY	51 - 200	Company - Private	Advertising & Marketing	Business Services	Unknown / Non-Applicable	Commerce Signals, Cardlytics, Yodlee	Affinity Solutions	NY	23	1	0	0	1	1	1	0	0	0	data scientist	na	na	Python,excel,sql,sas	89.45312
Data Scientist	3.4	Dallas, TX	201 - 500	Company - Public	Real Estate	Real Estate	$1 to $2 billion (USD)	Digital Realty, CoreSite, Equinix	CyrusOne	TX	21	1	0	1	1	1	0	0	1	1	data scientist	na	na	Python,aws,excel,sql,bi,mongo	96.44670

summary(data_scientist_tree) %>%
  kbl(caption = "Summary of Tree Based Data Subset") %>%
  kable_styling(full_width = F)  %>% 
  kable_paper("hover", full_width = F) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  scroll_box(width = "1000px", height = "500px")

Summary of Tree Based Data Subset
Job Title	Rating	Location	Size	Type of ownership	Industry	Sector	Revenue	Competitors	Hourly	Employer provided	company_txt	State	Age	Python	spark	aws	excel	sql	sas	keras	pytorch	scikit	tensor	hadoop	tableau	bi	flink	mongo	google_an	job_title_sim	seniority_by_title	Degree	Skillset	scaled_salary
Length:296	Min. :-1.000	Length:296	Length:296	Length:296	Length:296	Length:296	Length:296	Length:296	0:296	0:291	Length:296	Length:296	Min. : -1.00	0: 56	0:212	0:217	0:141	0:120	0:246	0:267	0:263	0:249	0:236	0:236	0:220	0:268	0:292	0:277	0:293	Length:296	Length:296	Length:296	Length:296	Min. : 15.97
Class :character	1st Qu.: 3.500	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1: 0	1: 5	Class :character	Class :character	1st Qu.: 11.00	1:240	1: 84	1: 79	1:155	1:176	1: 50	1: 29	1: 33	1: 47	1: 60	1: 60	1: 76	1: 28	1: 4	1: 19	1: 3	Class :character	Class :character	Class :character	Class :character	1st Qu.: 84.06
Mode :character	Median : 3.800	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	NA	NA	Mode :character	Mode :character	Median : 26.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mode :character	Mode :character	Mode :character	Mode :character	Median : 98.97
NA	Mean : 3.754	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean : 48.36	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean :103.60
NA	3rd Qu.: 4.100	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 63.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:120.87
NA	Max. : 5.000	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :277.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :253.75

# Created a couple of trees, there are some crazy job titles that are messing with this model, I'm going to remove Job title and stick with the job_title_simplified

As the focus of our model is on determining the salary of a data scientist, I have created a subset of the total dataset that strictly looks at the rows of the dataset that have the simplified job tite of “data scientist”. This data set includes 293 variables.

Tree Based Modeling Approach

Full & Pruned Trees

Due to the high amount of categorical data, I’ll start with an analysis based on tree-methods, as they will be most flexible with handling categorical data types with many components. I’m using an “anova” method to create a Regression tree, rather than a Classification tree.

library(rpart.plot)
salary_dt <- rpart(
  formula = scaled_salary ~ .,
  data    = data_scientist_tree,
  method  = "anova"
)
rpart.plot(salary_dt)

The above tree plots some default decision nodes for modeling salary among data scientists - at this point the tree model is not very interpretable, and is, in all likelihood, overfit. We can use this as a gut check to examine our assumptions about any relavent factors to the dataset.

plotcp(salary_dt)

The above plot displays the complexity parameter (depth) of the tree vs. the X-Val relative error of the model. As we increase the complexity parameter, the model becomes more accurate, ax expected. It appears that we receive only marginal gains at a complexity parameter of 6, which may be a good cutoff point for pruning the full tree later on.

salary_dt2 <- rpart(
    formula = scaled_salary ~ .,
    data    = data_scientist_tree,
    method  = "anova", 
    control = list(cp = 0, xval = 10)
)

plotcp(salary_dt2)
abline(v = 11, lty = "dashed")

I created the tree model again witha broader range of complexity parameters, there appears to be convergence around a complexity parameter of 6 in terms of accuracy gained by additions to the model.

vip(salary_dt, num_features = 40, bar = FALSE)

Above is a Variable Importance Plot, which displays which factors are most important to how the model is currently calculating expected scaled salary value.

A lot of the findings above match some basic intuitions about our data - where you work matters, both in terms of who you’re working for (your company) but also your physical location. Even after scaling for salary, where you work is important, certain locations will have better paying jobs, even after scaling for cost of living. What skills you bring also appear important to determining overall salary. Competitors, Industry, Sector also appear important - we’ll see if they continue to be across the other models that will be created.

train_index <- sample(c(1:dim(data_scientist_tree)[1]), dim(data_scientist_tree)[1]*0.6)
train_df <- data_scientist_tree[train_index, ]
valid.df <- data_scientist_tree[-train_index, ]

default_tree <- rpart(scaled_salary ~ ., data = data_scientist_tree, 
                    control = rpart.control(maxdepth = 30), method = "anova")

cv.ct <- rpart(scaled_salary ~ ., data = train_df, method = "anova", 
               cp = 0.00001, minsplit = 5, xval = 5)
# use printcp() to print the table
printcp(cv.ct)

## 
## Regression tree:
## rpart(formula = scaled_salary ~ ., data = train_df, method = "anova", 
##     cp = 1e-05, minsplit = 5, xval = 5)
## 
## Variables actually used in tree construction:
## [1] company_txt Industry    Job Title   Location    mongo       Rating     
## [7] Size        Skillset   
## 
## Root node error: 168193/177 = 950.24
## 
## n= 177 
## 
##            CP nsplit rel error  xerror    xstd
## 1  5.2538e-01      0  1.000000 1.00269 0.16844
## 2  1.5202e-01      1  0.474622 0.83219 0.14247
## 3  9.0022e-02      2  0.322603 0.87409 0.15408
## 4  4.0734e-02      3  0.232581 0.80394 0.12124
## 5  3.2209e-02      4  0.191847 0.82608 0.12197
## 6  3.2132e-02      5  0.159639 0.82398 0.12334
## 7  1.9600e-02      6  0.127506 0.84766 0.12924
## 8  1.6752e-02      7  0.107907 0.84255 0.12941
## 9  1.1981e-02      8  0.091155 0.85437 0.12873
## 10 1.0202e-02      9  0.079174 0.81701 0.12345
## 11 6.6105e-03     10  0.068972 0.81454 0.12358
## 12 5.6117e-03     11  0.062362 0.79088 0.11961
## 13 5.5206e-03     12  0.056750 0.79427 0.12029
## 14 5.1270e-03     13  0.051229 0.79427 0.12029
## 15 4.7964e-03     14  0.046102 0.78902 0.12004
## 16 3.4944e-03     15  0.041306 0.78337 0.12009
## 17 2.8147e-03     16  0.037811 0.77966 0.11978
## 18 2.4474e-03     17  0.034997 0.77194 0.11865
## 19 2.2700e-03     18  0.032549 0.78074 0.12244
## 20 1.9409e-03     19  0.030279 0.78133 0.12253
## 21 1.5321e-03     20  0.028338 0.78316 0.12219
## 22 1.5227e-03     21  0.026806 0.78611 0.12236
## 23 1.3888e-03     22  0.025284 0.78581 0.12237
## 24 1.3318e-03     23  0.023895 0.78581 0.12237
## 25 1.2118e-03     24  0.022563 0.78531 0.12239
## 26 4.7831e-04     25  0.021351 0.79570 0.12390
## 27 4.4868e-04     26  0.020873 0.80085 0.12456
## 28 3.9100e-04     27  0.020424 0.79854 0.12446
## 29 2.3451e-04     28  0.020033 0.80065 0.12470
## 30 2.2698e-04     29  0.019799 0.80305 0.12469
## 31 1.9145e-04     30  0.019572 0.80305 0.12469
## 32 1.7595e-04     31  0.019380 0.80426 0.12472
## 33 1.7382e-04     32  0.019204 0.80383 0.12471
## 34 1.6125e-04     33  0.019031 0.80256 0.12471
## 35 1.5940e-04     34  0.018869 0.80152 0.12462
## 36 1.5269e-04     35  0.018710 0.80034 0.12462
## 37 1.4256e-04     36  0.018557 0.80001 0.12463
## 38 1.2596e-04     37  0.018415 0.79910 0.12459
## 39 1.0232e-04     38  0.018289 0.80036 0.12470
## 40 7.7590e-05     39  0.018186 0.80022 0.12488
## 41 7.4497e-05     40  0.018109 0.79960 0.12486
## 42 7.4212e-05     41  0.018034 0.80010 0.12488
## 43 6.4402e-05     42  0.017960 0.79994 0.12489
## 44 4.7736e-05     43  0.017896 0.80329 0.12550
## 45 4.2362e-05     44  0.017848 0.80364 0.12568
## 46 3.3233e-05     45  0.017806 0.80440 0.12570
## 47 3.0659e-05     46  0.017772 0.80357 0.12571
## 48 3.0292e-05     47  0.017742 0.80332 0.12571
## 49 2.3732e-05     48  0.017711 0.80326 0.12571
## 50 1.3405e-05     49  0.017688 0.80274 0.12571
## 51 1.2380e-05     50  0.017674 0.80402 0.12611
## 52 1.0477e-05     51  0.017662 0.80357 0.12610
## 53 1.0000e-05     52  0.017651 0.80343 0.12610

Above is the code for creating an optimal pruned tree for regression based ont he XError value comparing tree models. The XError refers to the error on cross validated data for the dataset (or PRESS Residuals), which is more relevant to the goal of predicting (on unseen data) what the salary of a data scientist will be.

pruned.ct <- prune(cv.ct, 
                   cp = cv.ct$cptable[which.min(cv.ct$cptable[, "xerror"]), "CP"])
length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])

## [1] 18

prp(pruned.ct, type = 1, extra = 1, split.font = 1, varlen = -10)

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

## Warning in abbreviate(names, minlen): abbreviate used with non-ASCII chars

The pruning process occurring above tunes the model based on the XError value and the complexity paramter, creating the optimal tree that maintains predictive accuracy while remaining sparse, and therefore more generalizable to new data. The pruned tree in this instance has a length of 5 decisions at most.

vip(pruned.ct, num_features = 15, bar = FALSE)

plotcp(pruned.ct)
abline(v = 11, lty = "dashed")

The most important variables in this pruned tree ramain the same as in our default tree - these variables will likely continue to be pertinent to the models that are yet to be created.

Random Forests

I will now create a Random Forest to draw on the decisions made by many different trees to create more accurate and more stable predictions in the aggregate.

n_features <- length(setdiff(names(train_df), "scaled_salary"))

# train a default random forest model
salary_rf <- ranger(
  scaled_salary ~ ., 
  data = janitor::clean_names(data_scientist_tree),
  mtry = floor(n_features / 3),
  respect.unordered.factors = "order",
  seed = 123
)

(default_rmse <- sqrt(salary_rf$prediction.error))

## [1] 7.99949

Above I have created a default random forest model (default meaning non-tuned) with a RMSE of 7.99, indicating that the average discrepancy between the current prediction of salary and the actual salary of a model is around 8k.

hyper_grid <- expand.grid(
  mtry = floor(n_features * c(.05, .15, .25, .333, .4)),
  min.node.size = c(1, 3, 5, 10), 
  replace = c(TRUE, FALSE),                               
  sample.fraction = c(.5, .63, .8),                       
  rmse = NA                                               
)

# execute full cartesian grid search
for(i in seq_len(nrow(hyper_grid))) {
  # fit model for ith hyperparameter combination
  fit <- ranger(
    formula         = scaled_salary ~ ., 
    data            = janitor::clean_names(data_scientist_tree), 
    num.trees       = n_features * 100,
    mtry            = hyper_grid$mtry[i],
    min.node.size   = hyper_grid$min.node.size[i],
    replace         = hyper_grid$replace[i],
    sample.fraction = hyper_grid$sample.fraction[i],
    verbose         = FALSE,
    seed            = 123,
    respect.unordered.factors = 'order',
  )
  # export OOB error 
  hyper_grid$rmse[i] <- sqrt(fit$prediction.error)
}

The above code creates a tuning grid that will assess random forest models along a variety of parameters, including node size, number of features included, and fraction of sample used in the model. I will then take the results of each of these forest models and compare the RMSE of the top 10 models.

# assess top 10 models
hyper_grid %>%
  arrange(rmse) %>%
  mutate(perc_gain = (default_rmse - rmse) / default_rmse * 100) %>%
  head(10)

##    mtry min.node.size replace sample.fraction     rmse  perc_gain
## 1    11             1   FALSE            0.80 7.054532 11.8127250
## 2    13             1   FALSE            0.80 7.059665 11.7485596
## 3    13             3   FALSE            0.80 7.362932  7.9574822
## 4    11             3   FALSE            0.80 7.400790  7.4842223
## 5     8             1   FALSE            0.80 7.489880  6.3705207
## 6    13             1   FALSE            0.63 7.533248  5.8283927
## 7    11             1   FALSE            0.63 7.602401  4.9639193
## 8    13             1    TRUE            0.80 7.823469  2.2004017
## 9     8             3   FALSE            0.80 7.858942  1.7569583
## 10   13             3   FALSE            0.63 7.987150  0.1542496

The highest performing models include a sample fraction of 0.8, a minimum node length of 1, and have similar RMSE values of 7.06 & 7.07, respectively. These models both made significant gains on our default tree RMSE value, increasing accuracy by ~ 11.5%.

rf_impurity <- ranger(
  formula = scaled_salary ~ ., 
  data = janitor::clean_names(train_df), 
  num.trees = 2000,
  mtry = 32,
  min.node.size = 1,
  sample.fraction = .80,
  replace = FALSE,
  importance = "impurity",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)

rf_permutation <- ranger(
  formula = scaled_salary ~ ., 
  data = janitor::clean_names(train_df), 
  num.trees = 2000,
  mtry = 32,
  min.node.size = 1,
  sample.fraction = .80,
  replace = FALSE,
  importance = "permutation",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)

Measuring model performance by impurity and permutation. Impurity refers to the average total reduction in the loss function across all trees for a given feature while permutation refers to out-of-bag predictive performance of a tree.

p1 <- vip::vip(rf_impurity, num_features = 25, bar = FALSE)
p2 <- vip::vip(rf_permutation, num_features = 25, bar = FALSE)

gridExtra::grid.arrange(p1, p2, nrow = 1)

The above plots measure variable importance in the tuned random forest models that performed best according to impurity and permutation.

These metrics both prioritize the same variables, but place a much heavier emphasis on Company Name than our prior tree models did.

MARS Modeling Approach

Mars models (or Multi-Adaptive Regression Splines) are a flexible and highly informative approach to solving both regression and classification problems, I’d highly encourage reading Brad Boehmke & Brandon Greenwell’s writing on MARS Models which details how to use and interpret MARS regression models. These models will perform variable selection, identify variable transformations to satisfy regression based statistical assumptions, and will present variable importance as well as clearly defined coefficients for each important variable in the model summary. This helps MARS models to walk a great line between perceived “Black Box” models with high predictive accuracy like ensemble methods and neural nets vs. more interpretable but typically less accurate tree & linear regression models.

data_scientist <- data_scientist[, -c(30)]
cv_mars_model <- earth(data_scientist$scaled_salary ~ .,
                    data = data_scientist, 
                    degree = 1,
                    pmethod = "cv",
                    nfold = 5)

summary(cv_mars_model)

## Call: earth(formula=data_scientist$scaled_salary~., data=data_scientist,
##             pmethod="cv", degree=1, nfold=5)
## 
##                                                                              coefficients
## (Intercept)                                                                    131.134870
## Job TitleAssociate, Data Science, Internal Audit                               -51.145337
## Job TitleClinical Data Scientist                                               -21.450940
## Job TitleData Science Manager                                                   43.905577
## Job TitleData Scientist - Research                                             -41.707254
## Job TitleData Scientist - Sales                                                 29.779312
## Job TitleData Scientist - Systems Engineering                                  -32.745319
## Job TitleDigital Health Data Scientist                                         -23.314629
## Job TitleDirector Data Science                                                  71.275828
## Job TitleDirector II, Data Science - GRS Predictive Analytics                  -59.440559
## Job TitleDirector, Data Science                                                108.845264
## Job TitleManager of Data Science                                               -32.059236
## Job TitleProduct Engineer â\200“ Spatial Data Science and Statistical Analysis    -38.486901
## Job TitleSenior Data Scientist Artificial Intelligence                         -22.068017
## Job TitleSr Expert Data Science, Advanced Visual Analytics (Associate level)   -23.806495
## Job TitleStaff Data Scientist - Technology                                      59.431864
## LocationBellevue, WA                                                            67.466195
## LocationChicago, IL                                                             19.846020
## LocationClearwater, FL                                                         -20.952917
## LocationEmeryville, CA                                                          37.578606
## LocationIndianapolis, IN                                                        17.471286
## LocationLaurel, MD                                                              22.350350
## LocationPort Washington, NY                                                    -31.358925
## LocationSaint Louis, MO                                                         73.665647
## LocationSan Francisco, CA                                                      -30.414295
## LocationSanta Barbara, CA                                                      -34.693284
## LocationSouth San Francisco, CA                                                 39.774503
## LocationWoodbridge, NJ                                                         -22.070326
## IndustryAdvertising & Marketing                                                 -7.113428
## IndustryConsulting                                                             -13.837371
## IndustryConsumer Product Rental                                                -33.620095
## IndustryEnergy                                                                 -20.978740
## IndustryEnterprise Software & Network Solutions                                  9.952381
## IndustryInsurance Agencies & Brokerages                                          9.875236
## IndustryInvestment Banking & Asset Management                                   14.070621
## IndustryIT Services                                                            -13.649725
## IndustryLending                                                                 18.498132
## IndustryStaffing & Outsourcing                                                  27.650121
## IndustryTravel Agencies                                                        -34.150563
## IndustryTV Broadcast & Cable Networks                                           13.580927
## Revenue$25 to $50 million (USD)                                                -12.243140
## CompetitorsAdvisory Board, Booz Allen Hamilton, McKinsey & Company              96.396253
## CompetitorsEngagio, Bombora, Terminus                                           26.643077
## CompetitorsHarvard Business School, Coursera, edX                              -40.940866
## CompetitorsHealthfirst (New York), naviHealth                                  -30.291353
## CompetitorsLeidos, CACI International, Booz Allen Hamilton                      24.827920
## Employer provided1                                                              41.972664
## company_txtEntefy                                                              -29.944609
## company_txth2o.ai                                                               14.865402
## company_txtNorthrop Grumman                                                    -13.356712
## company_txtPinnacol Assurance                                                  -17.610871
## company_txtTechProjects                                                         20.779574
## StateMD                                                                         16.561085
## StateNM                                                                        -16.068242
## seniority_by_titlena                                                           -25.039880
## Skillsetexcel                                                                  -45.867807
## SkillsetPython                                                                  -7.109618
## SkillsetPython,aws,excel,sql,hadoop,tableau,bi                                  25.337724
## SkillsetPython,aws,sql                                                         -23.982416
## SkillsetPython,excel,sql                                                        11.740501
## SkillsetPython,sas                                                             138.255743
## SkillsetPython,spark,excel,sql,hadoop                                          -24.572663
## SkillsetPython,spark,sql                                                       -11.155593
## SkillsetPython,sql,tableau,bi                                                   11.944402
## Skillsetsql,sas                                                                -12.689846
## h(Rating-4.1)                                                                  -46.023423
## h(4.3-Rating)                                                                   -0.236265
## h(Rating-4.3)                                                                   68.071363
## h(Age-10)                                                                       -1.821629
## h(18-Age)                                                                       -0.171531
## h(Age-18)                                                                        1.869528
## 
## Selected 71 of 71 terms, and 66 of 623 predictors (pmethod="cv")
## Termination condition: RSq changed by less than 0.001 at 71 terms
## Importance: SkillsetPython,sas, seniority_by_titlena, ...
## Number of terms at each degree of interaction: 1 70 (additive model)
## GRSq 0.8948737  RSq 0.9715139  mean.oof.RSq 0.7311483 (sd 0.0738)
## 
## pmethod="backward" would have selected:
##     69 terms 66 preds,  GRSq 0.8999176  RSq 0.9714345  mean.oof.RSq 0.7233419

vip(
  cv_mars_model,
  num_features= 20
)

Above is a summary of the initial MARS regression model assessed on 5-fold cross validation for Data Scientists. The skills that a person brings, their location, their job title, and their industry are all important, as in our earlier models. What is most striking is the specificity of factors - specific locations, skills, and titles are mentioned by this model. Just another reason to love MARS models!

sqrt(mean(cv_mars_model$residuals^2))

## [1] 5.394519

This model has a high R- squared value on the data available and a low RMSE - making it a strong competitor to the Random Forest models generated in the previous section. This means that it describes the variation in the output variable remarkably well, however, we could be excluding important factors like interactions between variables and the performance on predicted values that are unseen.

There appear to be several skill-based factors included in this model. Let’s assess the importance of interactions in the model.

In order to assess the performance of the model more honestly, I’m using cross-validation to measure the accuracy of the predictions created by the model on hidden data. In this case, we are still retaining a high level of accuracy on unseen data with our predictive model in the first degree. In this instance we have an R-squared value of 0.971.

cv_interactions_model_2 <- earth(data_scientist$scaled_salary ~ .,
                    data = data_scientist, 
                    degree = 2,
                    pmethod = "cv",
                    nfold = 5)

cv_interactions_model_3 <- earth(data_scientist$scaled_salary ~ .,
                    data = data_scientist, 
                    degree = 3,
                    pmethod = "cv",
                    nfold = 5)
cv_interactions_model_4 <- earth(data_scientist$scaled_salary ~ .,
                    data = data_scientist, 
                    degree = 4,
                    pmethod = "cv",
                    nfold = 5)

Above I’ve created models that measure 2nd - 4th degree interactions between variables in the dataset, meaning assessing all the combinations of variables and their impact on the outcome.

summary(cv_interactions_model_2)

## Call: earth(formula=data_scientist$scaled_salary~., data=data_scientist,
##             pmethod="cv", degree=2, nfold=5)
## 
##                                                                                              coefficients
## (Intercept)                                                                                    113.967503
## Job TitleAssociate, Data Science, Internal Audit                                               -63.058408
## Job TitleData Science Analyst                                                                  -20.676057
## Job TitleData Science Manager                                                                   31.229062
## Job TitleData Scientist - Systems Engineering                                                  -35.852348
## Job TitleData Scientist Manager                                                                 28.894477
## Job TitleDirector Data Science                                                                  53.624485
## Job TitleDirector II, Data Science - GRS Predictive Analytics                                  -59.440559
## Job TitleDirector, Data Science                                                                 97.059009
## Job TitleProduct Engineer â\200“ Spatial Data Science and Statistical Analysis                    -27.471253
## Job TitleSenior Data Scientist / Machine Learning                                              -31.718409
## Job TitleStaff Data Scientist - Technology                                                      56.270262
## LocationAustin, TX                                                                             -24.193928
## LocationBellevue, WA                                                                            64.557659
## LocationChicago, IL                                                                             12.586667
## LocationPort Washington, NY                                                                    -26.330875
## LocationSan Francisco, CA                                                                      -18.993611
## LocationSanta Barbara, CA                                                                      -33.543545
## LocationSouth San Francisco, CA                                                                 26.531919
## IndustryConsumer Product Rental                                                                -41.317384
## IndustryEnergy                                                                                 -25.925817
## IndustryInvestment Banking & Asset Management                                                   19.215968
## IndustryIT Services                                                                             -8.744719
## IndustryLending                                                                                 23.754380
## IndustryStaffing & Outsourcing                                                                  25.719183
## IndustryTravel Agencies                                                                        -27.512585
## Revenue$25 to $50 million (USD)                                                                -15.269445
## CompetitorsAdvisory Board, Booz Allen Hamilton, McKinsey & Company                              81.362275
## CompetitorsFoursquare                                                                          -14.028748
## CompetitorsHarvard Business School, Coursera, edX                                              -44.770006
## CompetitorsInfosys, EPAM, Accenture                                                            -22.013124
## CompetitorsLeidos, CACI International, Booz Allen Hamilton                                      28.942478
## company_txtEntefy                                                                              -25.021386
## company_txth2o.ai                                                                               18.461167
## StateMD                                                                                         19.334517
## StateNM                                                                                        -15.480351
## aws1                                                                                             7.884330
## seniority_by_titlena                                                                           -21.031857
## SkillsetPython,aws,sql                                                                         -26.978607
## SkillsetPython,excel,sql,sas                                                                     9.978542
## SkillsetPython,sas                                                                             143.254745
## h(4.3-Rating)                                                                                    1.954686
## h(21-Age)                                                                                        1.110074
## h(Age-21)                                                                                        0.034256
## Job TitleHealthcare Data Scientist * StateMD                                                   -28.046619
## Rating * SkillsetPython,excel,sql                                                                6.173303
## LocationSaint Louis, MO * seniority_by_titlena                                                  62.095465
## LocationSan Francisco, CA * sql1                                                                -8.277100
## Size201 - 500 * seniority_by_titlena                                                            -9.341278
## IndustryComputer Hardware & Software * seniority_by_titlena                                     -6.757121
## IndustryConsulting * seniority_by_titlena                                                       -8.179259
## company_txtPfizer * seniority_by_titlena                                                       -30.134287
## seniority_by_titlena * Skillsetexcel                                                           -48.732212
## Job TitleData Scientist * h(Rating-4.3)                                                         30.687854
## Job TitleData Scientist - Research * h(4.3-Rating)                                             -36.137412
## Job TitleData Scientist - Sales * h(Rating-4.3)                                                136.381288
## Job TitleSenior Data Scientist Artificial Intelligence * h(Age-21)                              -0.428596
## Job TitleSr Expert Data Science, Advanced Visual Analytics (Associate level) * h(4.3-Rating)   -44.218736
## h(4.3-Rating) * LocationEmeryville, CA                                                          23.767832
## h(Rating-4.3) * Size501 - 1000                                                                 -28.615774
## h(Rating-4.3) * IndustryConsulting                                                             -94.332387
## h(4.3-Rating) * IndustryTV Broadcast & Cable Networks                                           23.400374
## h(4.3-Rating) * Revenue$10 to $25 million (USD)                                                -19.339757
## h(4.3-Rating) * company_txtNorthrop Grumman                                                    -27.766407
## h(4.3-Rating) * StateNY                                                                         -9.615383
## h(4.3-Rating) * spark1                                                                          -8.572931
## Size501 - 1000 * h(Age-21)                                                                      -0.231818
## StateFL * h(21-Age)                                                                             -1.402785
## h(21-Age) * sql1                                                                                -0.801701
## 
## Selected 69 of 73 terms, and 64 of 623 predictors (pmethod="cv")
## Termination condition: RSq changed by less than 0.001 at 73 terms
## Importance: SkillsetPython,sas, seniority_by_titlena, ...
## Number of terms at each degree of interaction: 1 43 25
## GRSq 0.8641676  RSq 0.9762886  mean.oof.RSq 0.663021 (sd 0.105)
## 
## pmethod="backward" would have selected:
##     66 terms 63 preds,  GRSq 0.866188  RSq 0.973681  mean.oof.RSq 0.5945903

vip(
  cv_interactions_model_2,
  num_features= 20
)

The 2nd degree interactions analyzed by the above model yield the most impressive results - an R-Squared value of 0.978 on Cross-Validated data, this model also has the lowest RMSE of all models created for this analysis. The second degree interactions created by this model are fascinating -

24 2nd degree interactions were included.
Rating generally only becomes important when considered in the context of industry, location, seniority, and skillset
Revenue of company only becomes important when considered in the context of company rating

summary(cv_interactions_model_3)

## Call: earth(formula=data_scientist$scaled_salary~., data=data_scientist,
##             pmethod="cv", degree=3, nfold=5)
## 
##                                                                                              coefficients
## (Intercept)                                                                                     97.846470
## Job TitleAssociate, Data Science, Internal Audit                                               -47.050925
## Job TitleData Science Manager                                                                   43.904809
## Job TitleData Scientist - Systems Engineering                                                  -34.879673
## Job TitleData Scientist Manager                                                                 22.294479
## Job TitleDirector Data Science                                                                  61.719973
## Job TitleDirector II, Data Science - GRS Predictive Analytics                                  -59.440559
## Job TitleDirector, Data Science                                                                107.207996
## Job TitleProduct Engineer â\200“ Spatial Data Science and Statistical Analysis                    -36.391809
## Job TitleSenior Data Scientist / Machine Learning                                              -27.596832
## Job TitleStaff Data Scientist - Technology                                                      57.245564
## LocationBellevue, WA                                                                            72.076617
## LocationChicago, IL                                                                             14.879843
## LocationPort Washington, NY                                                                    -37.225207
## LocationSan Ramon, CA                                                                          -21.239990
## LocationSanta Barbara, CA                                                                      -26.732179
## LocationSanta Clara, CA                                                                         16.190296
## LocationSouth San Francisco, CA                                                                 23.281748
## IndustryConsumer Product Rental                                                                -34.199607
## IndustryEnergy                                                                                 -26.792715
## IndustryHealth Care Services & Hospitals                                                       -12.499328
## IndustryInvestment Banking & Asset Management                                                   20.544559
## IndustryK-12 Education                                                                          21.789731
## IndustryLending                                                                                 26.417413
## IndustryStaffing & Outsourcing                                                                  38.329883
## IndustryTravel Agencies                                                                        -24.970301
## CompetitorsAdvisory Board, Booz Allen Hamilton, McKinsey & Company                              77.344046
## CompetitorsHarvard Business School, Coursera, edX                                              -36.755063
## CompetitorsInfosys, EPAM, Accenture                                                            -26.545876
## CompetitorsLeidos, CACI International, Booz Allen Hamilton                                      29.606963
## company_txtEntefy                                                                              -30.318917
## company_txth2o.ai                                                                               26.344650
## StateMD                                                                                         18.252473
## StateNM                                                                                        -17.710541
## seniority_by_titlena                                                                           -24.052973
## SkillsetPython,aws,sql                                                                         -22.410268
## SkillsetPython,excel,sql                                                                        21.925213
## SkillsetPython,sas                                                                             141.708570
## h(Age-18)                                                                                        6.764016
## h(21-Age)                                                                                        2.093253
## h(Age-21)                                                                                       -6.729104
## LocationBoston, MA * seniority_by_titlena                                                      -51.511106
## LocationSaint Louis, MO * seniority_by_titlena                                                  65.729242
## LocationSan Francisco, CA * sql1                                                               -22.418244
## LocationSan Francisco, CA * tensor1                                                            -13.352711
## Size18264 * seniority_by_titlena                                                               -13.111209
## IndustryAerospace & Defense * seniority_by_titlena                                               7.709355
## IndustryConsulting * seniority_by_titlena                                                      -11.744153
## CompetitorsLeidos, CACI International, Booz Allen Hamilton * seniority_by_titlena               25.642272
## company_txtPfizer * seniority_by_titlena                                                       -22.912605
## seniority_by_titlena * Skillsetexcel                                                           -42.074384
## Job TitleData Scientist - Sales * h(Rating-4.3)                                                122.126976
## Job TitleSenior Data Scientist - R&D Oncology * h(4.3-Rating)                                  -35.361628
## Job TitleSenior Data Scientist Artificial Intelligence * h(Age-21)                              -0.495732
## Job TitleSr Expert Data Science, Advanced Visual Analytics (Associate level) * h(4.3-Rating)   -50.611333
## h(4.3-Rating) * LocationEmeryville, CA                                                          26.143606
## h(4.3-Rating) * LocationRaleigh, NC                                                             38.944123
## h(Rating-4.3) * Size501 - 1000                                                                 -30.004219
## h(4.3-Rating) * Revenue$10 to $25 million (USD)                                                -19.531618
## h(4.3-Rating) * company_txtNorthrop Grumman                                                    -36.737039
## h(Rating-4.3) * StateNY                                                                         50.245159
## h(4.3-Rating) * spark1                                                                          -5.376771
## Size501 - 1000 * h(Age-21)                                                                      -0.252266
## IndustryIT Services * h(Age-21)                                                                 -0.459273
## h(11-Age) * seniority_by_titlena                                                                -1.586548
## LocationBoston, MA * Type of ownershipCompany - Private * seniority_by_titlena                  46.583331
## Size201 - 500 * sas1 * seniority_by_titlena                                                    -17.803902
## 
## Selected 67 of 78 terms, and 64 of 623 predictors (pmethod="cv")
## Termination condition: RSq changed by less than 0.001 at 78 terms
## Importance: SkillsetPython,sas, seniority_by_titlena, ...
## Number of terms at each degree of interaction: 1 40 24 2
## GRSq 0.85922  RSq 0.9733693  mean.oof.RSq 0.6369842 (sd 0.107)
## 
## pmethod="backward" would have selected:
##     70 terms 67 preds,  GRSq 0.863459  RSq 0.9771317  mean.oof.RSq NA
##     (mean.oof.RSq is NA because most fold models have less than 70 terms)

vip(
  cv_interactions_model_3,
  num_features= 20
)

summary(cv_interactions_model_4)

## Call: earth(formula=data_scientist$scaled_salary~., data=data_scientist,
##             pmethod="cv", degree=4, nfold=5)
## 
##                                                                             coefficients
## (Intercept)                                                                    117.61386
## Job TitleAssociate, Data Science, Internal Audit                               -46.08742
## Job TitleData Science Manager                                                   48.78422
## Job TitleDirector Data Science                                                  62.96789
## Job TitleDirector II, Data Science - GRS Predictive Analytics                  -59.44056
## Job TitleDirector, Data Science                                                110.62049
## Job TitleProduct Engineer â\200“ Spatial Data Science and Statistical Analysis    -38.74706
## Job TitleStaff Data Scientist - Technology                                      59.43612
## LocationBellevue, WA                                                            54.77813
## LocationChicago, IL                                                             13.54958
## IndustryEnergy                                                                 -27.98491
## IndustryStaffing & Outsourcing                                                  33.19838
## IndustryTravel Agencies                                                        -31.32819
## CompetitorsAdvisory Board, Booz Allen Hamilton, McKinsey & Company              80.13801
## CompetitorsLeidos, CACI International, Booz Allen Hamilton                      46.36281
## company_txth2o.ai                                                               28.68829
## seniority_by_titlena                                                           -24.39220
## SkillsetPython,sas                                                             146.97502
## LocationSaint Louis, MO * seniority_by_titlena                                  70.27384
## LocationSan Francisco, CA * sql1                                               -28.74327
## IndustryConsulting * seniority_by_titlena                                      -16.94693
## seniority_by_titlena * Skillsetexcel                                           -40.90739
## 
## Selected 22 of 78 terms, and 22 of 623 predictors (pmethod="cv")
## Termination condition: RSq changed by less than 0.001 at 78 terms
## Importance: SkillsetPython,sas, seniority_by_titlena, ...
## Number of terms at each degree of interaction: 1 17 4
## GRSq 0.7294638  RSq 0.8180003  mean.oof.RSq 0.4202951 (sd 0.211)
## 
## pmethod="backward" would have selected:
##     70 terms 67 preds,  GRSq 0.863459  RSq 0.9771317  mean.oof.RSq NA
##     (mean.oof.RSq is NA because most fold models have less than 70 terms)

We can see an increase in the accuracy of our scaled salary predictions as we increase the degree of interactions within the model through the decreased RMSE, with our optimal RMSE being achieved at a model measuring 2nd degree interactions.

sqrt(mean(cv_interactions_model_2$residuals^2))

## [1] 4.921699

sqrt(mean(cv_interactions_model_3$residuals^2))

## [1] 5.215881

sqrt(mean(cv_interactions_model_4$residuals^2))

## [1] 13.63552

The RMSE of the 2nd degree model is the lowest of any model in the data, which, in addition to it’s high R-Squared value on Cross-Validated data, indicates that it is highly accurate on unseen data and reliably predicts within 5k of the expected salary of a given position after scaling for location.

Conclusions

General Findings

Among Random Forest, Decision Tree, and MARS Models of the 1st - 4th degree, 2nd Degree MARS modeling is the most accurate on this dataset - as well as the most informative. The Location a job is in, what company you work at, what industry you’re in, and what skills you bring are universally considered important by these modeling techniques to determining salary for a role as a data scientist, with the most important single skill being Python.

SAS was also included as being important in this model - however, I believe that this finding can be tempered by additional context. SAS has largely fallen out of fashion, meaning that those still using it will likely have been practicing professionals for a long time - a high proficiency in SAS can be considered a heuristic for experience, and director/senior level roles (also accompanying experience) similarly correspond to higher salary values.

As a new data scientist - the most profitable skills for you to learn are Python, SQL, AWS, & Tableau, but more importantly, you need to find a city and a company that values Data Scientists. That is an essential, and overlooked, component of the salary of a data scientist as indicated by the modeling work I’ve done here.

The highest performing model in this analysis was a MARS Model analyzing 2nd-degree interactions between variables, achieving an R-Squared value of 0.97 on Cross Validated Data and a RMSE of 4.76, which constitutes a 5% difference in the average salary of the untransformed dataset.

Unexpected Findings

Python & SAS as a skillset increase average scaled salary by 149k – all else held equal
Competitors of Harvard Business School, Coursera, edX reduce average scaled salary by 34k – all else held equal
A job requiring only Python in San Francisco reduces average salary by 25k – all else held equal
Company ratings below 4.3 that require Excel for a job reduce average salary by 63k, while Python, Excel, SQL, and SAS together for similar company ratings increase average salary by 13k

Salary Modeling with MARS Using Cross Validation

John Trygier

2022-02-01