Final Project: Data Science Culmination Project

Author

Josh Miller & Elijah Hill

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
styled <-
  theme_bw() + 
  theme(
    plot.title = element_text(face = "bold", size = 12),
    legend.background = element_rect(
      fill = "white", 
      linewidth = 4, 
      colour = "white"
    ),
    axis.ticks = element_line(colour = "grey70", linewidth = 0.2),
    panel.grid.major = element_line(colour = "grey70", linewidth = 0.2),
    panel.grid.minor = element_blank()
  )
library("tidymodels") ; theme_set(styled)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library("janitor")

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library("olsrr")

Attaching package: 'olsrr'
The following object is masked from 'package:datasets':

    rivers
library("doParallel")
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators
Loading required package: parallel
library("dplyr")
library("kernlab")

Attaching package: 'kernlab'
The following object is masked from 'package:scales':

    alpha
The following object is masked from 'package:purrr':

    cross
The following object is masked from 'package:ggplot2':

    alpha
library("rpart.plot")
Loading required package: rpart

Attaching package: 'rpart'
The following object is masked from 'package:dials':

    prune
library("glmnet")
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-8
library("GGally")
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library("cowplot")

Attaching package: 'cowplot'
The following object is masked from 'package:lubridate':

    stamp
library("jtools")

Attaching package: 'jtools'
The following object is masked from 'package:yardstick':

    get_weights
library("caret")
Loading required package: lattice

Attaching package: 'caret'
The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity
The following object is masked from 'package:purrr':

    lift
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

Introduction:

For our Final Project, the dataset we decided to use was titled Salary by Job Title and Country. We found the dataset from Kaggle.com.

https://www.kaggle.com/datasets/amirmahdiabbootalebi/salary-by-job-title-and-country/data

The dataset creator sourced this data from reputable employment websites and surveys, leaving out names and companies to ensure privacy for both parties.

Salary <- read_csv("Salary.csv")
Rows: 6684 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Gender, Job Title, Country, Race
dbl (5): Age, Education Level, Years of Experience, Salary, Senior

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

There are 9 variables in the data, with 6684 observations. The variables are as follows: Age, Gender, Education Level, Job Title, Years of Experience, Salary, Country, Race, and Senior. Education level is encoded from 0-3, 0 meaning the employee has a high school diploma as their highest level of education, 1 meaning that they have a Bachelor’s degree, 2 meaning they have a Master’s, and 3 meaning they have a Doctorate’s. The senior variable is a binary value indicating whether or not they have a senior-level position. Salary has been converted into USD for all countries for the sake of being on the same scale.

head(Salary)
# A tibble: 6 × 9
    Age Gender `Education Level` `Job Title`       `Years of Experience` Salary
  <dbl> <chr>              <dbl> <chr>                             <dbl>  <dbl>
1    32 Male                   1 Software Engineer                     5  90000
2    28 Female                 2 Data Analyst                          3  65000
3    45 Male                   3 Manager                              15 150000
4    36 Female                 1 Sales Associate                       7  60000
5    52 Male                   2 Director                             20 200000
6    29 Male                   1 Marketing Analyst                     2  55000
# ℹ 3 more variables: Country <chr>, Race <chr>, Senior <dbl>

Questions and Goals:

Our main question we wanted to answer was “Can we accurately predict the salary of a job given the predictors in this data set”, those being Age, Senior, Country, Race, Job Title, Gender, and Education Level. We also wanted to explore the roles each of the predictors play in determining Salary. Some secondary questions we asked to determine this during our EDA were: “Is one gender more often lower-paid than another?”, “Does an increase in age usually lead to an increase in salary?”, “How big a difference does a job being a senior position make on average to Salary?”, and more to go along with that: “Are older people more likely to be the ones occupying senior positions?”. Whether or not the Education Level or Country of the job seems to give access to a higher salary were also questions we asked and found answers to.

Preprocessing:

For preprocessing, we quickly found 2 issues: First, we realized that certain job titles only appear once in the entire data set, one of the most notable being CEO. While this had one of the largest values for salary in the entire dataset, we realized that this would not only skew our EDA but would also cause problems for our testing and training splits later on. Therefore, we decided to drop them.

We then found an issue with values that were likely misreported within the dataset. Upon analyzing the bottom-most values for annual salary in the dataset, we found multiple employees reported only making 3 figures with jobs that in every other case paid well above that, such as Software Engineer Manager. We could be making a large assumption here that this was a full-time position being paid a yearly salary, but even if these values were correctly recorded, it would still be inconsistent with the rest of the dataset and cause a skew in the lowest-paying jobs.

##PREPROCESSING

#removing any job only included once 
Salary_cleaned <- Salary %>%
  group_by(`Job Title`) %>%
  mutate(
    count = n()
  ) %>%
  filter(count > 1) %>%
  ungroup() %>%
  select(-count)
#dropping probable mistaken entries (reported less than 1k salaries)
Salary_cleaned <- Salary_cleaned %>%
  arrange(Salary) %>%
  filter(!row_number() %in% c(1,2,3,4))

EDA:

Exploring Gender:

We wanted to explore if Females still earned less than men on average, as they have historically, so we first looked at a general average of all salaries of men versus those of women.

#EDA (Gender)

##GENDER DIFFERENCES
Salary_cleaned_by_gender <- Salary_cleaned %>%
  group_by(Gender) %>%
    summarize(Mean = mean(Salary, na.rm = TRUE))
#On average, women earn less than men

Salary_cleaned_by_gender
# A tibble: 2 × 2
  Gender    Mean
  <chr>    <dbl>
1 Female 107981.
2 Male   121503.

This table shows a sizable difference (about 19000) in the average salary of a male over one of a female, supporting our initial theory. We then split up the data to more deeply delve into the differences in pay between the two Genders.

#Splitting salary into male and female
salary_male <- Salary_cleaned %>%
  group_by(Gender) %>%
    filter(Gender == "Male")

salary_female <- Salary_cleaned %>%
  group_by(Gender) %>%
    filter(Gender == "Female")

After splitting the data, we tried making four plots showing the top 15 highest-salary jobs and the bottom 15 lowest-salary jobs for comparison.

# #comparing highest/lowest earning male/female jobs
#  top_male_salaries <- salary_male %>%
#  arrange(desc(Salary)) %>%
#    slice(1:15)
#  
#  #ignoring mistaken entries (1 and 2 row)
#  bottom_male_salaries <- salary_male %>%
#  arrange(Salary) %>%
#    slice(3:17)
#  
#  top_female_salaries <- salary_female %>%
#  arrange(desc(Salary)) %>%
#  slice(1:15)
#  
# #ignoring mistaken entries (1 and 2 row)
#  bottom_female_salaries <- salary_female %>%
#  arrange(Salary) %>%
#    slice(3:17)

#plots
#tms_plot <- ggplot(top_male_salaries, aes(x = Salary)) +       #geom_bar(fill = "blue") +
#theme_light()

#bms_plot <- ggplot(bottom_male_salaries, aes(x = Salary)) +
# geom_bar(fill = "turquoise2") +
# theme_dark()

#the above plots don't look good...
#they are mostly the same jobs

#trying again but with...

the above plots don’t look good… they are mostly the same jobs trying again but with averaging jobs with the same title together.

Upon making the first few plots, we realized that the above plots did not look good as they were mostly showing the same job titles’ salaries repeated multiple times. We remade the graphs but this time combined the job titles to eliminate repeated Job Titles. First, we made new tables to use with a new Average_Salary column for each job title, then eliminated other columns and rows besides unique Average_Salaries and Job Titles since those were what we were focusing on.

#averaging jobs with the same title together
salary_male_unique <- salary_male %>%
  group_by(`Job Title`) %>%
  mutate(Average_Salary = mean(Salary)) %>%
  distinct(Average_Salary)
  
salary_female_unique <- salary_female %>%
  group_by(`Job Title`) %>%
  mutate(Average_Salary = mean(Salary)) %>%
  distinct(Average_Salary)

salary_male_unique
# A tibble: 61 × 2
# Groups:   Job Title [61]
   `Job Title`                    Average_Salary
   <chr>                                   <dbl>
 1 Sales Associate                        33515.
 2 Delivery Driver                        28000 
 3 Sales Representative                   46444.
 4 Digital Marketing Manager              75968.
 5 HR Generalist                          72776.
 6 HR Coordinator                         34667.
 7 Accountant                             53750 
 8 Software Developer                     68011.
 9 Business Development Associate         38333.
10 Operations Analyst                     69167.
# ℹ 51 more rows
salary_female_unique
# A tibble: 67 × 2
# Groups:   Job Title [67]
   `Job Title`                     Average_Salary
   <chr>                                    <dbl>
 1 Sales Associate                         28207.
 2 Sales Representative                    35833.
 3 Receptionist                            25000 
 4 HR Coordinator                          41062.
 5 Customer Service Representative         33333.
 6 HR Generalist                           48855.
 7 Juniour HR Coordinator                  32000 
 8 Marketing Analyst                       63083.
 9 Business Development Associate          42500 
10 Operations Manager                      95200 
# ℹ 57 more rows
#comparing highest/lowest earning male/female jobs
top_male_salaries_unique <- 
  salary_male_unique %>%
  ungroup() %>% arrange(desc(Average_Salary)) %>% slice(1:10)

bottom_male_salaries_unique <- 
  salary_male_unique %>%
  ungroup() %>% 
  arrange(Average_Salary) %>% 
  slice(1:10)

top_female_salaries_unique <- 
  salary_female_unique %>%
  ungroup() %>% 
  arrange(desc(Average_Salary)) %>% slice(1:10)

bottom_female_salaries_unique <- 
  salary_female_unique %>%
  ungroup %>%
  arrange(Average_Salary) %>%
    slice(1:10)

top_male_salaries_unique 
# A tibble: 10 × 2
   `Job Title`               Average_Salary
   <chr>                              <dbl>
 1 Director of Data Science         207742.
 2 Marketing Director               189900 
 3 Director of Engineering          180000 
 4 Software Engineer Manager        173385.
 5 Project Engineer                 173344.
 6 Director of Operations           171667.
 7 Director of Finance              170000 
 8 Research Director                165870.
 9 Data Scientist                   165062.
10 Director of Marketing            160641.
bottom_male_salaries_unique
# A tibble: 10 × 2
   `Job Title`                    Average_Salary
   <chr>                                   <dbl>
 1 Delivery Driver                        28000 
 2 Sales Associate                        33515.
 3 HR Coordinator                         34667.
 4 Business Operations Analyst            35000 
 5 Business Development Associate         38333.
 6 Juniour HR Generalist                  43000 
 7 Sales Representative                   46444.
 8 Sales Executive                        47083.
 9 Graphic Designer                       51667.
10 Accountant                             53750 
top_female_salaries_unique
# A tibble: 10 × 2
   `Job Title`                 Average_Salary
   <chr>                                <dbl>
 1 Director of Data Science           200769.
 2 Director of Human Resources        187500 
 3 Director of Finance                180000 
 4 Director of Operations             174000 
 5 Product Manager                    172476.
 6 Software Engineer Manager          171793.
 7 Data Scientist                     162667.
 8 Marketing Director                 162667.
 9 Data Engineer                      160000 
10 Research Director                  159310.
bottom_female_salaries_unique
# A tibble: 10 × 2
   `Job Title`                     Average_Salary
   <chr>                                    <dbl>
 1 Receptionist                            25000 
 2 Sales Associate                         28207.
 3 Juniour HR Coordinator                  32000 
 4 Customer Service Representative         33333.
 5 Sales Representative                    35833.
 6 HR Coordinator                          41062.
 7 Sales Executive                         41154.
 8 Business Development Associate          42500 
 9 Copywriter                              42500 
10 Juniour HR Generalist                   43000 

We made the plots again, making sure to standardize the x-axis values to more clearly show any differences in pay. We made male plots blue, and female red, top salary plots have a light theme, and bottom salary plots use the dark theme to differentiate and help show the comparisons we were looking for.

tms_plot <- ggplot(top_male_salaries_unique, aes(x = Average_Salary)) +
    geom_histogram(fill = "blue") + 
    labs(y = "Job Count", x = "Average Salary (Male)") +
    xlim(150000, 250000)
bms_plot <- ggplot(bottom_male_salaries_unique, aes(x = Average_Salary)) +
    geom_histogram(fill = "turquoise2", bins = 40) +
    labs(y = "Job Count", x = "Average Salary (Male)") +
    xlim(24000, 58000)+ ylim(0, 3) + theme_dark()
tfs_plot <- ggplot(top_female_salaries_unique, aes(x = Average_Salary)) +
    geom_histogram(fill = "red") +
    labs(y = "Job Count", x = "Average Salary (Female)") +
    xlim(150000, 250000)
bfs_plot <- ggplot(bottom_female_salaries_unique, aes(x = Average_Salary)) +
    geom_histogram(fill = "lightcoral") +
    labs(y = "Job Count", x = "Average Salary (Female)") +
    xlim(24000, 58000) + ylim(0, 3) + theme_dark()

We used the “cowplot” package to easily combine all four plots into one graphic for a more complete visual comparison of gender salary differences on the poles of the data.

plot_grid(tms_plot, bms_plot, tfs_plot, bfs_plot, nrow = 2, ncol = 2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

After doing this, we realized that we could do almost the same thing, but in a broader sense (as well as faster), by just using a box plot.

#comparing all salaries
ggplot(Salary_cleaned, aes(x = Salary, y = Gender)) +
    geom_boxplot(aes(
        fill = as.factor(`Gender`))) +
    scale_color_manual(values = c("blue", "red")) +
    theme(legend.position = "none")

The box plots, as well as the four plots prior, all point to what we had guessed which was that males do indeed earn higher salaries than females on average.

Exploring Education Level:

Next, we examined the role education level played on salary amount. This time, we started with the general plot comparing all salaries grouped by Education Level, then moved on to showing the average salary of each Education Level after that.

#EDA (Education Level)

ggplot(Salary_cleaned, aes(y = Salary,
                           x = as.factor(`Education Level`),
                           fill = as.factor(`Education Level`
                                            )))+
    scale_color_manual(values = c("red", "green", "yellow", "darkorchid3")) +
    labs(x = "Education Level") + 
    geom_boxplot() +
    theme(legend.position = "none")

salary_by_ed_lvl <- Salary_cleaned %>%
  group_by(`Education Level`) %>%
  summarize(Mean = mean(Salary))

ggplot(salary_by_ed_lvl, aes(x = `Education Level`,
                             y = Mean,
                             fill = as.factor(
                                 `Education Level`
                                 ))) +
    scale_color_manual(values = c("red", "green", "yellow", "darkorchid3" )) + 
    labs(y = "Mean Salary") +
    geom_col() +
    theme(legend.position = "none")

We were pleased to see not only that the data seemed to indicate that going to college is indeed still worth it, but that the data was nice and linear as well for both the raw and the average salary by education level comparisons.

Exploring Age and Seniority:

Age and Seniority were two predictors we were especially excited to look at, and we had high expectations on the strength of the correlation between them and the salaries those of high age and in senior positions would hold. Once again, we showed a general plot using the raw salary data when compared with Age, this time using whether or not the job holder had the Senior status to determine the color of the plot point. After getting a nice-looking scatter plot from that (and being very happy with the color palette), we could see that there was some positive correlation between the age of a person, whether or not they would be in a senior position, and their salary. To get a slightly different perspective, we grouped the ages by decade and compared each Age Group’s average salary to each other, and were once again satisfied to see a seemingly linear relationship between Age and Salary.

#EDA (Age/Seniority)
ggplot(Salary_cleaned, aes(y = Salary, x = Age, color = as.factor(Senior))) +
    scale_color_manual(values = c("ivory4", "goldenrod"),
                       labels = c("Non-Senior Position", 
                                  "Senior")) +
    labs(color = "Seniority") + geom_point()

salary_by_age <- Salary_cleaned %>%
  mutate(Age_Group = case_when(
    Age < 30 ~ "29 & Younger",
    Age < 40 & Age >= 30 ~ "30's",
    Age < 50 & Age >= 40 ~ "40's",
    Age < 60 & Age >= 50 ~ "50's",
    Age >= 60 ~ "60 & Older"
  )) %>% 
  group_by(Age_Group) %>%
  summarize(Average_Salary = mean(Salary))

ggplot(salary_by_age, aes(x = Age_Group,
                          y = Average_Salary,
                          fill = as.factor(Age_Group)),) +
    scale_fill_manual(values = c("ivory4","grey",
                                 "lightgoldenrod2",
                                 "goldenrod2",
                                 "goldenrod3")) +
    geom_col() +
    theme(legend.position = "none",
                       panel.background = element_rect(fill = "slategray3")) +
    labs(x = "Age Group", y = "Average Salary")

Exploring Country and Race:

The first thing we did to look at Country and Race was to use multilevel grouping to get a better understanding of the demographics of the data. After noting the variety of Races in each Country, we proceeded to make a violin plot comparing the Salaries of those living in different Countries. That plot did not look great, so we reverted to using box plots for comparing Races’ Salary earnings. The main takeaway we received from these two plots was that the Country and Race of a person do not seem to be significant factors in determining one’s salary.

#EDA (Country/Race)
Salary_cleaned %>% 
  group_by(Country, Race) %>%
  summarize(count = n())
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
# A tibble: 17 × 3
# Groups:   Country [5]
   Country   Race             count
   <chr>     <chr>            <int>
 1 Australia Asian              470
 2 Australia Australian         449
 3 Australia White              407
 4 Canada    Asian              452
 5 Canada    Black              428
 6 Canada    White              431
 7 China     Chinese            441
 8 China     Korean             454
 9 China     White              438
10 UK        Asian              328
11 UK        Mixed              329
12 UK        Welsh              330
13 UK        White              327
14 USA       African American   349
15 USA       Asian              330
16 USA       Hispanic           318
17 USA       White              346
ggplot(Salary_cleaned, aes(x = Salary, y = Country, fill = as.factor(Country))) + 
    scale_fill_manual(values = c("red", "white", "gold", "purple", "blue")) +
    geom_violin() +
    theme(legend.position = "none")

ggplot(Salary_cleaned, aes(x = Salary, y = Race, color =  as.factor(Race))) +
    geom_boxplot() +
    theme(legend.position = "none")

Exploring Job Title:

When thinking of what to explore with Job Titles, we were at first a little unsure of what to compare, since there were so many unique Job Titles in the data. We ended up simply making a table of the top 10 highest-salary jobs and the “top 10” lowest-salary jobs.

#EDA (Job Title)
#hrm...
salary_by_job_title <- Salary_cleaned %>%
  group_by(`Job Title`) %>%
  mutate(Average_Salary = mean(Salary)) %>%
  distinct(Average_Salary)

top_jobs <- salary_by_job_title %>%
  ungroup() %>%
  arrange(desc(Average_Salary)) %>%
  slice(1:10)

worst_jobs <- salary_by_job_title %>%
  ungroup() %>%
  arrange(Average_Salary) %>%
  slice(1:10)

top_jobs
# A tibble: 10 × 2
   `Job Title`                 Average_Salary
   <chr>                                <dbl>
 1 Director of Data Science           204561.
 2 Director of Human Resources        187500 
 3 Marketing Director                 183615.
 4 Director of Engineering            180000 
 5 Director of Finance                175000 
 6 Software Engineer Manager          172961.
 7 Director of Operations             172727.
 8 Project Engineer                   166064.
 9 Data Scientist                     164099.
10 Research Director                  163333.
worst_jobs
# A tibble: 10 × 2
   `Job Title`                     Average_Salary
   <chr>                                    <dbl>
 1 Receptionist                            25000 
 2 Delivery Driver                         28000 
 3 Sales Associate                         30736.
 4 Juniour HR Coordinator                  32000 
 5 Customer Service Representative         33333.
 6 Business Operations Analyst             35000 
 7 HR Coordinator                          38321.
 8 Business Development Associate          40714.
 9 Sales Representative                    41728.
10 Copywriter                              42500 

One interesting thing that we could see from these tables is that Job Titles with “Director” and “Engineer” are featured frequently in the higher end of the Salary data. This could either be an insight into the types of jobs that give high Salaries, or the types of jobs that the data was scraped from. Either way, the wide range of names meant that job titles were most likely going to be largely ineffective as a predictor for our models.

Modeling:

Now that we have gathered some insights about this data as well as having answered our minor questions from our exploratory analysis, we will use modeling to answer our main question.

Before we get into creating the models, we will split the salary dataset into a training and testing data frame, using a 90/10 proportion respectively.

salary_split <- initial_split(Salary_cleaned, prop = 0.90)
training <-training(salary_split)
testing <- testing(salary_split)

To create a ridge regression model, we need to turn all of our datasets into numeric factors. We will get to the ridge regression model later. This is simply up here for rendering reasons.

ridge_salary <- Salary_cleaned %>%
  transform(.,
                   Race = as.numeric(as.factor(Race)),
                   Country = as.numeric(as.factor(Country)),
                   `Job Title` = as.numeric(as.factor(`Job Title`)),
                   Gender = as.numeric(as.factor(Gender)))
ridge_split <- initial_split(ridge_salary, prop = .90)
train <- training(ridge_split)
test <- testing(ridge_split)

Multiple Linear Regression:

Now that we split the data into training and testing, we will create our first model: a multiple linear regression model. A multiple linear regression model is simple, yet it can still give a good benchmark for comparisons to our other models.

fit <- lm(Salary ~ ., data = training)
summary(fit)

Call:
lm(formula = Salary ~ ., data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-128640  -11531      72   11323   64802 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                 27961.02   11753.96   2.379
Age                                            52.69     128.84   0.409
GenderMale                                    125.09     647.39   0.193
`Education Level`                            6759.87     602.31  11.223
`Job Title`Accountant                       -1448.32   14449.62  -0.100
`Job Title`Administrative Assistant        -36959.78   19393.59  -1.906
`Job Title`Back end Developer               29012.78   11303.87   2.567
`Job Title`Business Analyst                  9278.53   12440.57   0.746
`Job Title`Business Development Associate   -8228.47   15023.22  -0.548
`Job Title`Business Development Manager     26294.31   15839.02   1.660
`Job Title`Business Operations Analyst     -11943.50   25029.12  -0.477
`Job Title`Content Marketing Manager        21068.34   11560.36   1.822
`Job Title`Copywriter                      -10176.52   19394.53  -0.525
`Job Title`Customer Service Manager        -23925.86   19405.42  -1.233
`Job Title`Customer Service Representative -10361.47   14461.58  -0.716
`Job Title`Data Analyst                     54128.02   11264.03   4.805
`Job Title`Data Engineer                    27455.85   15864.04   1.731
`Job Title`Data Scientist                   54416.86   11289.41   4.820
`Job Title`Delivery Driver                  -4148.77   15832.76  -0.262
`Job Title`Digital Marketing Manager        13344.43   11673.16   1.143
`Job Title`Digital Marketing Specialist      5186.10   12800.93   0.405
`Job Title`Director of Data Science         60397.70   11706.62   5.159
`Job Title`Director of Engineering          25450.91   19430.92   1.310
`Job Title`Director of Finance              20684.08   19417.56   1.065
`Job Title`Director of HR                    9145.11   11582.89   0.790
`Job Title`Director of Human Resources      20276.92   19440.59   1.043
`Job Title`Director of Marketing            25010.38   11519.39   2.171
`Job Title`Director of Operations           12660.84   13295.45   0.952
`Job Title`Engineer                          8080.92   19437.87   0.416
`Job Title`Event Coordinator               -20695.61   25007.95  -0.828
`Job Title`Financial Advisor                16330.15   15024.22   1.087
`Job Title`Financial Analyst                18973.42   11663.88   1.627
`Job Title`Financial Manager                44673.16   11375.96   3.927
`Job Title`Front end Developer              22786.79   11299.90   2.017
`Job Title`Front End Developer              18680.92   11924.37   1.567
`Job Title`Full Stack Engineer              34864.63   11286.77   3.089
`Job Title`Graphic Designer                   826.81   12320.43   0.067
`Job Title`HR Coordinator                   -9100.63   12089.48  -0.753
`Job Title`HR Generalist                      -23.01   11426.79  -0.002
`Job Title`HR Manager                        8075.77   15841.77   0.510
`Job Title`Human Resources Coordinator     -10140.39   11668.72  -0.869
`Job Title`Human Resources Manager          15254.65   11354.89   1.343
`Job Title`IT Consultant                    25852.45   25046.94   1.032
`Job Title`IT Support Specialist            -4107.66   19394.57  -0.212
`Job Title`Juniour HR Coordinator           -5902.54   19393.01  -0.304
`Job Title`Juniour HR Generalist             1036.75   17109.53   0.061
`Job Title`Manager                          24193.95   19427.65   1.245
`Job Title`Marketing Analyst                 7623.76   11366.30   0.671
`Job Title`Marketing Coordinator             8905.06   11348.74   0.785
`Job Title`Marketing Director               64593.51   11582.31   5.577
`Job Title`Marketing Manager                17552.02   11287.43   1.555
`Job Title`Marketing Specialist              5269.26   13239.95   0.398
`Job Title`Operations Analyst               -4211.39   13703.76  -0.307
`Job Title`Operations Coordinator           19127.15   17093.34   1.119
`Job Title`Operations Manager               14127.03   11403.03   1.239
`Job Title`Product Designer                  6068.38   11503.84   0.528
`Job Title`Product Manager                  56716.41   11279.24   5.028
`Job Title`Product Marketing Manager        27728.19   11576.72   2.395
`Job Title`Project Coordinator               6785.32   15041.38   0.451
`Job Title`Project Engineer                 52300.18   11312.23   4.623
`Job Title`Project Manager                  25440.83   11846.63   2.148
`Job Title`Receptionist                     -6458.27   11629.58  -0.555
`Job Title`Recruiter                       -22397.21   17096.11  -1.310
`Job Title`Research Director                50900.40   11607.01   4.385
`Job Title`Research Scientist               42189.94   11471.55   3.678
`Job Title`Sales Associate                  -8477.07   11315.14  -0.749
`Job Title`Sales Director                   33900.71   11592.35   2.924
`Job Title`Sales Executive                  -2245.43   11838.37  -0.190
`Job Title`Sales Manager                    21826.89   11601.75   1.881
`Job Title`Sales Representative             -7536.37   11504.21  -0.655
`Job Title`Scientist                        17019.22   19412.17   0.877
`Job Title`Social Media Manager               -34.87   12605.58  -0.003
`Job Title`Social Media Specialist          -9195.83   19388.73  -0.474
`Job Title`Software Developer                3043.92   11329.39   0.269
`Job Title`Software Engineer                45475.88   11223.51   4.052
`Job Title`Software Engineer Manager        36631.60   11316.25   3.237
`Job Title`Training Specialist             -17204.57   19396.36  -0.887
`Job Title`UX Designer                      21038.06   17118.99   1.229
`Job Title`Web Developer                     -293.64   11383.97  -0.026
`Years of Experience`                        5376.44     158.63  33.892
CountryCanada                                 131.23    1131.20   0.116
CountryChina                                  712.88    1459.93   0.488
CountryUK                                      45.73    1226.28   0.037
CountryUSA                                    178.60    1213.63   0.147
RaceAsian                                    2459.05    1619.57   1.518
RaceAustralian                               2688.23    2084.49   1.290
RaceBlack                                    1113.91    2097.20   0.531
RaceChinese                                   319.89    2273.09   0.141
RaceHispanic                                 1575.39    1840.74   0.856
RaceKorean                                   1894.87    2270.29   0.835
RaceMixed                                    2896.50    2238.72   1.294
RaceWelsh                                     553.37    2231.90   0.248
RaceWhite                                    2852.07    1618.34   1.762
Senior                                     -11413.52    1281.70  -8.905
                                           Pr(>|t|)    
(Intercept)                                0.017398 *  
Age                                        0.682578    
GenderMale                                 0.846795    
`Education Level`                           < 2e-16 ***
`Job Title`Accountant                      0.920163    
`Job Title`Administrative Assistant        0.056728 .  
`Job Title`Back end Developer              0.010294 *  
`Job Title`Business Analyst                0.455801    
`Job Title`Business Development Associate  0.583907    
`Job Title`Business Development Manager    0.096948 .  
`Job Title`Business Operations Analyst     0.633249    
`Job Title`Content Marketing Manager       0.068435 .  
`Job Title`Copywriter                      0.599804    
`Job Title`Customer Service Manager        0.217645    
`Job Title`Customer Service Representative 0.473722    
`Job Title`Data Analyst                    1.58e-06 ***
`Job Title`Data Engineer                   0.083558 .  
`Job Title`Data Scientist                  1.47e-06 ***
`Job Title`Delivery Driver                 0.793302    
`Job Title`Digital Marketing Manager       0.253014    
`Job Title`Digital Marketing Specialist    0.685393    
`Job Title`Director of Data Science        2.56e-07 ***
`Job Title`Director of Engineering         0.190310    
`Job Title`Director of Finance             0.286818    
`Job Title`Director of HR                  0.429831    
`Job Title`Director of Human Resources     0.296982    
`Job Title`Director of Marketing           0.029959 *  
`Job Title`Director of Operations          0.341000    
`Job Title`Engineer                        0.677622    
`Job Title`Event Coordinator               0.407953    
`Job Title`Financial Advisor               0.277116    
`Job Title`Financial Analyst               0.103858    
`Job Title`Financial Manager               8.70e-05 ***
`Job Title`Front end Developer             0.043788 *  
`Job Title`Front End Developer             0.117258    
`Job Title`Full Stack Engineer             0.002018 ** 
`Job Title`Graphic Designer                0.946498    
`Job Title`HR Coordinator                  0.451617    
`Job Title`HR Generalist                   0.998393    
`Job Title`HR Manager                      0.610227    
`Job Title`Human Resources Coordinator     0.384870    
`Job Title`Human Resources Manager         0.179181    
`Job Title`IT Consultant                   0.302040    
`Job Title`IT Support Specialist           0.832275    
`Job Title`Juniour HR Coordinator          0.760861    
`Job Title`Juniour HR Generalist           0.951684    
`Job Title`Manager                         0.213058    
`Job Title`Marketing Analyst               0.502417    
`Job Title`Marketing Coordinator           0.432676    
`Job Title`Marketing Director              2.56e-08 ***
`Job Title`Marketing Manager               0.119998    
`Job Title`Marketing Specialist            0.690658    
`Job Title`Operations Analyst              0.758613    
`Job Title`Operations Coordinator          0.263193    
`Job Title`Operations Manager              0.215438    
`Job Title`Product Designer                0.597860    
`Job Title`Product Manager                 5.09e-07 ***
`Job Title`Product Marketing Manager       0.016644 *  
`Job Title`Project Coordinator             0.651926    
`Job Title`Project Engineer                3.86e-06 ***
`Job Title`Project Manager                 0.031793 *  
`Job Title`Receptionist                    0.578689    
`Job Title`Recruiter                       0.190221    
`Job Title`Research Director               1.18e-05 ***
`Job Title`Research Scientist              0.000237 ***
`Job Title`Sales Associate                 0.453779    
`Job Title`Sales Director                  0.003464 ** 
`Job Title`Sales Executive                 0.849571    
`Job Title`Sales Manager                   0.059975 .  
`Job Title`Sales Representative            0.512431    
`Job Title`Scientist                       0.380670    
`Job Title`Social Media Manager            0.997793    
`Job Title`Social Media Specialist         0.635313    
`Job Title`Software Developer              0.788189    
`Job Title`Software Engineer               5.15e-05 ***
`Job Title`Software Engineer Manager       0.001214 ** 
`Job Title`Training Specialist             0.375115    
`Job Title`UX Designer                     0.219147    
`Job Title`Web Developer                   0.979423    
`Years of Experience`                       < 2e-16 ***
CountryCanada                              0.907647    
CountryChina                               0.625356    
CountryUK                                  0.970255    
CountryUSA                                 0.883013    
RaceAsian                                  0.128983    
RaceAustralian                             0.197229    
RaceBlack                                  0.595339    
RaceChinese                                0.888088    
RaceHispanic                               0.392119    
RaceKorean                                 0.403955    
RaceMixed                                  0.195779    
RaceWelsh                                  0.804194    
RaceWhite                                  0.078064 .  
Senior                                      < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22360 on 5870 degrees of freedom
Multiple R-squared:  0.823, Adjusted R-squared:  0.8202 
F-statistic: 293.5 on 93 and 5870 DF,  p-value: < 2.2e-16

As we can see, the most significant predictors are education level, years of experience, and senior, all of which make sense as it is logical that the more years of experience you have in a profession and the higher level of education you have, the more likely you are going to earn more money than someone who has less experience and a lesser degree. Seniority also makes sense as a senior-level position undoubtedly has more responsibilities than someone who isn’t. However, we can see that several job title codes are good indicators. Jobs such as software engineer, research scientist, research director, product manager, and data scientist/analyst all appear to be very good predictors for our model. This may well be because there are simply more observations of these job titles in the data set, but all of these fields are certainly very highly-paying positions. Now, to look at the results. We can see that the model generated an R-squared value of .82, on an F-stat of 296.3, and a p-value of <2.2e-16, so needless to say, this is a respectable model; it is not perfect, but there is a strong positive correlation between the predictors and Salary.

Before we go any further, we should check the assumptions of our model to see if this dataset even can be fitted into a linear model.

par(mfrow = c(2,2))
plot(fit)
Warning: not plotting observations with leverage one:
  1659, 4690
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Checking the normal assumptions of linear regression, we can see that the data appears to fit to an acceptable level. The residuals vs. Fitted values graph is distributed mostly evenly from end to end, and the Q-Q Residuals plot, while both tails do slightly veer off the mean, they do at least mirror each other.

Now let us fit this model into our testing data. As you can see, we bound the predicted outcomes onto the testing dataset so we can compare the predicted value to the employee’s actual salary.

lm_preds <- predict(fit, testing) %>%
  bind_cols(testing)
New names:
• `` -> `...1`
lm_preds
# A tibble: 663 × 10
     ...1   Age Gender `Education Level` `Job Title`       `Years of Experience`
    <dbl> <dbl> <chr>              <dbl> <chr>                             <dbl>
 1 28938.    24 Male                   0 Sales Associate                       1
 2 28926.    28 Female                 0 Sales Associate                       1
 3 28946.    30 Female                 0 Sales Associate                       1
 4 25096.    21 Female                 0 Sales Representa…                     0
 5 24219.    21 Female                 0 Sales Representa…                     0
 6 25619.    24 Female                 0 Receptionist                          0
 7 25375.    24 Female                 0 Receptionist                          0
 8 24521.    24 Female                 0 Receptionist                          0
 9 25798.    24 Female                 0 Receptionist                          0
10 25226.    24 Female                 0 Receptionist                          0
# ℹ 653 more rows
# ℹ 4 more variables: Salary <dbl>, Country <chr>, Race <chr>, Senior <dbl>

While the predictions are not perfect, the model does get rather close to predicting the salary of some employees, with some predictions getting even within 1000 dollars of the actual value. However, it is not perfect, so let’s tune the model to see if we can improve the accuracy.

Let us try to optimize the model by running a step-forward selection model to see what variables it would choose to use.

ols_step_forward_p(fit)

                                       Stepwise Summary                                        
---------------------------------------------------------------------------------------------
Step    Variable                    AIC           SBC           SBIC         R2       Adj. R2 
---------------------------------------------------------------------------------------------
 0      Base Model               146619.170    146632.557    129691.086    0.00000    0.00000 
 1      `Years of Experience`    140211.084    140231.164    123283.645    0.65863    0.65858 
 2      `Education Level`        139565.367    139592.141    122637.451    0.69376    0.69366 
 3      Age                      139330.624    139364.092    122402.112    0.70568    0.70553 
 4      Gender                   139250.640    139290.801    122321.447    0.70970    0.70951 
 5      Senior                   139203.843    139250.697    122273.957    0.71207    0.71182 
 6      `Job Title`              136462.281    137011.147    119389.382    0.82269    0.82028 
---------------------------------------------------------------------------------------------

Final Model Output 
------------------

                              Model Summary                               
-------------------------------------------------------------------------
R                           0.907       RMSE                   22199.211 
R-Squared                   0.823       MSE                492804973.341 
Adj. R-Squared              0.820       Coef. Var                 19.363 
Pred R-Squared               -Inf       AIC                   136462.281 
MAE                     16411.937       SBC                   137011.147 
-------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares          DF         Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression    1.363699e+13          80    170462322393.804    341.204    0.0000 
Residual      2.939089e+12        5883       499590151.454                      
Total         1.657607e+13        5963                                          
--------------------------------------------------------------------------------

                                                       Parameter Estimates                                                         
----------------------------------------------------------------------------------------------------------------------------------
                                     model          Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
----------------------------------------------------------------------------------------------------------------------------------
                               (Intercept)     30724.678     11628.362                  2.642    0.008      7928.817    53520.539 
                     `Years of Experience`      5385.583       158.432        0.613    33.993    0.000      5074.999     5696.167 
                         `Education Level`      6742.715       601.793        0.113    11.204    0.000      5562.978     7922.451 
                                       Age        46.957       128.680        0.007     0.365    0.715      -205.302      299.217 
                                GenderMale       151.735       646.690        0.001     0.235    0.815     -1116.014     1419.484 
                                    Senior    -11349.575      1279.553       -0.075    -8.870    0.000    -13857.969    -8841.181 
                     `Job Title`Accountant     -2056.067     14429.640       -0.001    -0.142    0.887    -30343.462    26231.328 
       `Job Title`Administrative Assistant    -37698.529     19366.301       -0.013    -1.947    0.052    -75663.592      266.533 
             `Job Title`Back end Developer     28680.100     11290.229        0.101     2.540    0.011      6547.105    50813.095 
               `Job Title`Business Analyst      8674.405     12425.564        0.009     0.698    0.485    -15684.264    33033.074 
 `Job Title`Business Development Associate     -8307.031     14999.506       -0.005    -0.554    0.580    -37711.571    21097.510 
   `Job Title`Business Development Manager     25609.340     15819.178        0.013     1.619    0.106     -5402.059    56620.738 
    `Job Title`Business Operations Analyst    -12059.258     24996.671       -0.003    -0.482    0.630    -61061.915    36943.398 
      `Job Title`Content Marketing Manager     20797.860     11548.012        0.040     1.801    0.072     -1840.485    43436.205 
                     `Job Title`Copywriter     -9793.106     19363.178       -0.003    -0.506    0.613    -47752.048    28165.835 
       `Job Title`Customer Service Manager    -23494.967     19375.866       -0.008    -1.213    0.225    -61478.782    14488.847 
`Job Title`Customer Service Representative    -10538.486     14439.823       -0.006    -0.730    0.466    -38845.842    17768.870 
                   `Job Title`Data Analyst     53782.934     11251.495        0.238     4.780    0.000     31725.871    75839.997 
                  `Job Title`Data Engineer     27061.970     15854.905        0.013     1.707    0.088     -4019.467    58143.408 
                 `Job Title`Data Scientist     54071.181     11278.109        0.275     4.794    0.000     31961.946    76180.417 
                `Job Title`Delivery Driver     -4003.384     15820.630       -0.002    -0.253    0.800    -35017.629    27010.861 
      `Job Title`Digital Marketing Manager     13131.119     11661.456        0.022     1.126    0.260     -9729.619    35991.857 
   `Job Title`Digital Marketing Specialist      4762.372     12787.796        0.004     0.372    0.710    -20306.404    29831.149 
       `Job Title`Director of Data Science     60096.867     11693.643        0.104     5.139    0.000     37173.032    83020.703 
        `Job Title`Director of Engineering     25087.910     19401.468        0.009     1.293    0.196    -12946.094    63121.913 
            `Job Title`Director of Finance     20724.950     19389.933        0.007     1.069    0.285    -17286.440    58736.340 
                 `Job Title`Director of HR      8779.091     11570.108        0.017     0.759    0.448    -13902.570    31460.752 
    `Job Title`Director of Human Resources     19040.902     19407.615        0.007     0.981    0.327    -19005.153    57086.957 
          `Job Title`Director of Marketing     24719.012     11507.850        0.053     2.148    0.032      2159.399    47278.624 
         `Job Title`Director of Operations     12407.757     13282.415        0.010     0.934    0.350    -13630.656    38446.170 
                       `Job Title`Engineer      7728.827     19403.033        0.003     0.398    0.690    -30308.245    45765.898 
              `Job Title`Event Coordinator    -20850.977     24993.560       -0.005    -0.834    0.404    -69847.536    28145.581 
              `Job Title`Financial Advisor     15872.345     15006.482        0.009     1.058    0.290    -13545.871    45290.561 
              `Job Title`Financial Analyst     18540.649     11650.909        0.031     1.591    0.112     -4299.413    41380.711 
              `Job Title`Financial Manager     44411.121     11364.459        0.122     3.908    0.000     22132.608    66689.634 
            `Job Title`Front end Developer     22497.292     11288.028        0.079     1.993    0.046       368.610    44625.973 
            `Job Title`Front End Developer     18083.972     11908.358        0.024     1.519    0.129     -5260.783    41428.727 
            `Job Title`Full Stack Engineer     34505.899     11273.416        0.138     3.061    0.002     12405.862    56605.936 
               `Job Title`Graphic Designer       178.238     12302.457        0.000     0.014    0.988    -23939.096    24295.572 
                 `Job Title`HR Coordinator     -9212.579     12078.197       -0.011    -0.763    0.446    -32890.281    14465.122 
                  `Job Title`HR Generalist      -362.243     11414.284       -0.001    -0.032    0.975    -22738.432    22013.946 
                     `Job Title`HR Manager      7595.974     15822.284        0.004     0.480    0.631    -23421.515    38613.463 
    `Job Title`Human Resources Coordinator    -10482.010     11656.804       -0.018    -0.899    0.369    -33333.627    12369.607 
        `Job Title`Human Resources Manager     14995.523     11342.480        0.044     1.322    0.186     -7239.903    37230.950 
                  `Job Title`IT Consultant     26733.118     25010.862        0.007     1.069    0.285    -22297.360    75763.595 
          `Job Title`IT Support Specialist     -5371.244     19365.376       -0.002    -0.277    0.782    -43334.495    32592.007 
         `Job Title`Juniour HR Coordinator     -5260.711     19368.727       -0.002    -0.272    0.786    -43230.530    32709.109 
          `Job Title`Juniour HR Generalist       138.779     17081.810        0.000     0.008    0.994    -33347.844    33625.401 
                        `Job Title`Manager     23860.771     19401.799        0.008     1.230    0.219    -14173.881    61895.424 
              `Job Title`Marketing Analyst      7349.764     11351.852        0.020     0.647    0.517    -14904.035    29603.563 
          `Job Title`Marketing Coordinator      8465.381     11334.119        0.025     0.747    0.455    -13753.655    30684.417 
             `Job Title`Marketing Director     64246.570     11568.634        0.125     5.554    0.000     41567.798    86925.342 
              `Job Title`Marketing Manager     17239.131     11273.753        0.069     1.529    0.126     -4861.567    39339.828 
           `Job Title`Marketing Specialist      4646.590     13225.861        0.004     0.351    0.725    -21280.955    30574.135 
             `Job Title`Operations Analyst     -4802.359     13688.903       -0.003    -0.351    0.726    -31637.637    22032.918 
         `Job Title`Operations Coordinator     19471.707     17081.719        0.008     1.140    0.254    -14014.737    52958.151 
             `Job Title`Operations Manager     13632.820     11390.393        0.034     1.197    0.231     -8696.535    35962.174 
               `Job Title`Product Designer      5587.308     11488.573        0.012     0.486    0.627    -16934.515    28109.130 
                `Job Title`Product Manager     56333.937     11267.370        0.230     5.000    0.000     34245.753    78422.122 
      `Job Title`Product Marketing Manager     27285.898     11562.081        0.054     2.360    0.018      4619.973    49951.823 
            `Job Title`Project Coordinator      5592.006     15021.791        0.003     0.372    0.710    -23856.222    35040.235 
               `Job Title`Project Engineer     51934.458     11301.998        0.209     4.595    0.000     29778.390    74090.525 
                `Job Title`Project Manager     25094.367     11837.029        0.035     2.120    0.034      1889.443    48299.292 
                   `Job Title`Receptionist     -6833.589     11617.930       -0.012    -0.588    0.556    -29608.998    15941.821 
                      `Job Title`Recruiter    -22051.374     17078.282       -0.009    -1.291    0.197    -55531.080    11428.333 
              `Job Title`Research Director     50413.889     11594.862        0.101     4.348    0.000     27683.701    73144.078 
             `Job Title`Research Scientist     41876.405     11458.996        0.102     3.654    0.000     19412.563    64340.246 
                `Job Title`Sales Associate     -8782.814     11303.392       -0.030    -0.777    0.437    -30941.614    13375.986 
                 `Job Title`Sales Director     33455.013     11576.985        0.062     2.890    0.004     10759.869    56150.156 
                `Job Title`Sales Executive     -2945.057     11823.114       -0.004    -0.249    0.803    -26122.703    20232.589 
                  `Job Title`Sales Manager     21454.290     11587.854        0.039     1.851    0.064     -1262.161    44170.741 
           `Job Title`Sales Representative     -8004.790     11491.638       -0.016    -0.697    0.486    -30532.620    14523.041 
                      `Job Title`Scientist     16912.029     19402.909        0.006     0.872    0.383    -21124.799    54948.857 
           `Job Title`Social Media Manager      -911.428     12588.226       -0.001    -0.072    0.942    -25588.975    23766.119 
        `Job Title`Social Media Specialist    -10009.376     19362.830       -0.003    -0.517    0.605    -47967.635    27948.883 
             `Job Title`Software Developer      2634.802     11318.323        0.008     0.233    0.816    -19553.268    24822.872 
              `Job Title`Software Engineer     45161.545     11211.635        0.281     4.028    0.000     23182.621    67140.468 
      `Job Title`Software Engineer Manager     36299.231     11303.871        0.160     3.211    0.001     14139.491    58458.971 
            `Job Title`Training Specialist    -19145.762     19364.884       -0.007    -0.989    0.323    -57108.047    18816.524 
                    `Job Title`UX Designer     20642.232     17100.579        0.009     1.207    0.227    -12881.185    54165.648 
                  `Job Title`Web Developer      -478.455     11372.605       -0.001    -0.042    0.966    -22772.938    21816.028 
----------------------------------------------------------------------------------------------------------------------------------

Unsurprisingly, the summary has chosen the variables that I had highlighted in the original model. Interestingly, this model scraps Gender, Race, and Country; it does not consider them strong enough to influence the model.

Now that we’ve figured out the ideal variables for the model, let’s create a new model to see if we can improve the accuracy by removing unnecessary predictors:

step_fit <- 
    lm(Salary ~ Age + `Education Level` +
           `Years of Experience` + Senior + 
           `Job Title`, data = training)
summary(step_fit)

Call:
lm(formula = Salary ~ Age + `Education Level` + `Years of Experience` + 
    Senior + `Job Title`, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-128339  -11606     -19   11049   64618 

Coefficients:
                                            Estimate Std. Error t value
(Intercept)                                 30716.00   11627.37   2.642
Age                                            50.23     127.91   0.393
`Education Level`                            6732.57     600.19  11.217
`Years of Experience`                        5383.39     158.14  34.041
Senior                                     -11342.17    1279.06  -8.868
`Job Title`Accountant                       -2030.78   14428.08  -0.141
`Job Title`Administrative Assistant        -37782.60   19361.43  -1.951
`Job Title`Back end Developer               28725.88   11287.64   2.545
`Job Title`Business Analyst                  8702.63   12423.98   0.700
`Job Title`Business Development Associate   -8286.94   14998.06  -0.553
`Job Title`Business Development Manager     25678.88   15815.13   1.624
`Job Title`Business Operations Analyst     -11980.22   24992.39  -0.479
`Job Title`Content Marketing Manager        20767.82   11546.37   1.799
`Job Title`Copywriter                       -9863.62   19359.29  -0.510
`Job Title`Customer Service Manager        -23509.65   19374.21  -1.213
`Job Title`Customer Service Representative -10610.91   14435.36  -0.735
`Job Title`Data Analyst                     53822.85   11249.31   4.785
`Job Title`Data Engineer                    27059.08   15853.63   1.707
`Job Title`Data Scientist                   54103.62   11276.36   4.798
`Job Title`Delivery Driver                  -3921.43   15815.50  -0.248
`Job Title`Digital Marketing Manager        13153.35   11660.14   1.128
`Job Title`Digital Marketing Specialist      4693.30   12783.38   0.367
`Job Title`Director of Data Science         60109.13   11692.59   5.141
`Job Title`Director of Engineering          25165.97   19397.06   1.297
`Job Title`Director of Finance              20715.00   19388.33   1.068
`Job Title`Director of HR                    8795.82   11568.96   0.760
`Job Title`Director of Human Resources      18956.35   19402.71   0.977
`Job Title`Director of Marketing            24725.47   11506.89   2.149
`Job Title`Director of Operations           12403.58   13281.34   0.934
`Job Title`Engineer                          7809.96   19398.39   0.403
`Job Title`Event Coordinator               -20922.56   24989.69  -0.837
`Job Title`Financial Advisor                15829.35   15004.16   1.055
`Job Title`Financial Analyst                18598.19   11647.39   1.597
`Job Title`Financial Manager                44377.13   11362.62   3.906
`Job Title`Front end Developer              22518.50   11286.76   1.995
`Job Title`Front End Developer              18169.17   11901.86   1.527
`Job Title`Full Stack Engineer              34510.24   11272.50   3.061
`Job Title`Graphic Designer                   136.76   12300.20   0.011
`Job Title`HR Coordinator                   -9218.07   12077.20  -0.763
`Job Title`HR Generalist                     -363.42   11413.37  -0.032
`Job Title`HR Manager                        7507.96   15816.57   0.475
`Job Title`Human Resources Coordinator     -10540.04   11653.24  -0.904
`Job Title`Human Resources Manager          14942.51   11339.32   1.318
`Job Title`IT Consultant                    26801.88   25007.14   1.072
`Job Title`IT Support Specialist            -5312.17   19362.19  -0.274
`Job Title`Juniour HR Coordinator           -5329.93   19364.92  -0.275
`Job Title`Juniour HR Generalist              110.88   17080.02   0.006
`Job Title`Manager                          23933.76   19397.75   1.234
`Job Title`Marketing Analyst                 7369.47   11350.63   0.649
`Job Title`Marketing Coordinator             8399.05   11329.68   0.741
`Job Title`Marketing Director               64299.46   11565.51   5.560
`Job Title`Marketing Manager                17219.58   11272.54   1.528
`Job Title`Marketing Specialist              4657.90   13224.71   0.352
`Job Title`Operations Analyst               -4767.97   13687.02  -0.348
`Job Title`Operations Coordinator           19438.15   17079.75   1.138
`Job Title`Operations Manager               13701.25   11385.74   1.203
`Job Title`Product Designer                  5665.83   11482.78   0.493
`Job Title`Product Manager                  56370.65   11265.38   5.004
`Job Title`Product Marketing Manager        27274.54   11561.05   2.359
`Job Title`Project Coordinator               5564.46   15020.13   0.370
`Job Title`Project Engineer                 51955.49   11300.74   4.598
`Job Title`Project Manager                  25138.45   11834.59   2.124
`Job Title`Receptionist                     -6902.11   11613.33  -0.594
`Job Title`Recruiter                       -22129.10   17073.70  -1.296
`Job Title`Research Director                50460.95   11592.20   4.353
`Job Title`Research Scientist               41874.13   11458.07   3.655
`Job Title`Sales Associate                  -8786.43   11302.47  -0.777
`Job Title`Sales Director                   33529.17   11571.74   2.898
`Job Title`Sales Executive                  -2961.62   11821.95  -0.251
`Job Title`Sales Manager                    21486.89   11586.09   1.855
`Job Title`Sales Representative             -7988.43   11490.50  -0.695
`Job Title`Scientist                        16994.22   19398.19   0.876
`Job Title`Social Media Manager              -973.72   12584.42  -0.077
`Job Title`Social Media Specialist         -10080.43   19358.91  -0.521
`Job Title`Software Developer                2644.85   11317.33   0.234
`Job Title`Software Engineer                45185.67   11210.26   4.031
`Job Title`Software Engineer Manager        36333.11   11302.04   3.215
`Job Title`Training Specialist             -19228.44   19360.12  -0.993
`Job Title`UX Designer                      20565.25   17096.06   1.203
`Job Title`Web Developer                     -467.46   11371.60  -0.041
                                           Pr(>|t|)    
(Intercept)                                 0.00827 ** 
Age                                         0.69459    
`Education Level`                           < 2e-16 ***
`Years of Experience`                       < 2e-16 ***
Senior                                      < 2e-16 ***
`Job Title`Accountant                       0.88807    
`Job Title`Administrative Assistant         0.05105 .  
`Job Title`Back end Developer               0.01096 *  
`Job Title`Business Analyst                 0.48366    
`Job Title`Business Development Associate   0.58060    
`Job Title`Business Development Manager     0.10450    
`Job Title`Business Operations Analyst      0.63170    
`Job Title`Content Marketing Manager        0.07213 .  
`Job Title`Copywriter                       0.61042    
`Job Title`Customer Service Manager         0.22501    
`Job Title`Customer Service Representative  0.46233    
`Job Title`Data Analyst                    1.76e-06 ***
`Job Title`Data Engineer                    0.08791 .  
`Job Title`Data Scientist                  1.64e-06 ***
`Job Title`Delivery Driver                  0.80418    
`Job Title`Digital Marketing Manager        0.25934    
`Job Title`Digital Marketing Specialist     0.71353    
`Job Title`Director of Data Science        2.82e-07 ***
`Job Title`Director of Engineering          0.19454    
`Job Title`Director of Finance              0.28537    
`Job Title`Director of HR                   0.44711    
`Job Title`Director of Human Resources      0.32861    
`Job Title`Director of Marketing            0.03169 *  
`Job Title`Director of Operations           0.35039    
`Job Title`Engineer                         0.68725    
`Job Title`Event Coordinator                0.40249    
`Job Title`Financial Advisor                0.29147    
`Job Title`Financial Analyst                0.11037    
`Job Title`Financial Manager               9.51e-05 ***
`Job Title`Front end Developer              0.04608 *  
`Job Title`Front End Developer              0.12692    
`Job Title`Full Stack Engineer              0.00221 ** 
`Job Title`Graphic Designer                 0.99113    
`Job Title`HR Coordinator                   0.44534    
`Job Title`HR Generalist                    0.97460    
`Job Title`HR Manager                       0.63503    
`Job Title`Human Resources Coordinator      0.36578    
`Job Title`Human Resources Manager          0.18763    
`Job Title`IT Consultant                    0.28387    
`Job Title`IT Support Specialist            0.78382    
`Job Title`Juniour HR Coordinator           0.78314    
`Job Title`Juniour HR Generalist            0.99482    
`Job Title`Manager                          0.21731    
`Job Title`Marketing Analyst                0.51620    
`Job Title`Marketing Coordinator            0.45852    
`Job Title`Marketing Director              2.82e-08 ***
`Job Title`Marketing Manager                0.12667    
`Job Title`Marketing Specialist             0.72469    
`Job Title`Operations Analyst               0.72758    
`Job Title`Operations Coordinator           0.25513    
`Job Title`Operations Manager               0.22888    
`Job Title`Product Designer                 0.62173    
`Job Title`Product Manager                 5.78e-07 ***
`Job Title`Product Marketing Manager        0.01835 *  
`Job Title`Project Coordinator              0.71105    
`Job Title`Project Engineer                4.36e-06 ***
`Job Title`Project Manager                  0.03370 *  
`Job Title`Receptionist                     0.55232    
`Job Title`Recruiter                        0.19499    
`Job Title`Research Director               1.37e-05 ***
`Job Title`Research Scientist               0.00026 ***
`Job Title`Sales Associate                  0.43696    
`Job Title`Sales Director                   0.00378 ** 
`Job Title`Sales Executive                  0.80219    
`Job Title`Sales Manager                    0.06371 .  
`Job Title`Sales Representative             0.48694    
`Job Title`Scientist                        0.38103    
`Job Title`Social Media Manager             0.93833    
`Job Title`Social Media Specialist          0.60259    
`Job Title`Software Developer               0.81523    
`Job Title`Software Engineer               5.63e-05 ***
`Job Title`Software Engineer Manager        0.00131 ** 
`Job Title`Training Specialist              0.32065    
`Job Title`UX Designer                      0.22905    
`Job Title`Web Developer                    0.96721    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22350 on 5884 degrees of freedom
Multiple R-squared:  0.8227,    Adjusted R-squared:  0.8203 
F-statistic: 345.6 on 79 and 5884 DF,  p-value: < 2.2e-16

Unfortunately, Rsq remained nearly the same. However, one saving grace of the tune is that we were able to slightly reduce residual standard error and increase our F-statistic, so it may not look like it at first glance, but the model is still stronger than our initial attempt, even if only slightly.

Now that we’ve tuned our model, let us visualize the predictions.

This first plot depicts the average predicted value of Salary at each age in the data.

effect_plot(step_fit, pred = Age)

As we can see, this shows a very strong correlation between salary and age.

This second plot once again depicts salary vs age, but this time plots the residual values along with showing the confidence interval of which the model operates. We can see most of our residuals lie within the interval, although there are a few outliers at both ends.

effect_plot(step_fit, pred = Age, interval = TRUE, partial.residuals = TRUE)

Tree Methods:

For our second model, we want to use the power of Tree methods to see if it could give us a better answer to our main question than multiple linear regression. We will be mainly focusing on the Decision Tree method, but we will also create a Random Forest tree for comparison.

Decision Tree:

As we learned from class, we know that decision trees can mirror human decision-making more than other methods. We want to try to put this to the test to create a decision tree model based on our salary data to see if it can accurately predict an employee’s salary using binary decision-making.

To begin, we will make an untidy decision tree to visualize the decision-making process the model will take to determine salary.

# Non-tidy way (for visualization purposes)
tree_fit <- rpart(Salary ~., data = train)

rpart.plot(tree_fit)

We can see from this output that Years of Experience and job titles are very influential in decision-making. To be able to print this tree without having tens of job names crowd out the actual Boolean expression, we coded the job title to be a numeric value and factorized it, so while it’s a bit harder to understand what is happening, the lower a job title’s value is, the less money the position makes. Back to the tree, we can see that the longer someone works, the more money they will earn, and there are no questions about what position they will hold; they will still earn more money due to their experience. However, when we go down the tree in the opposite direction (meaning an employee has less experience), their position starts to play a more pivotal role.

Below we can see the decision tree model fitted onto the testing data. We can also see the predicted values compared to the actual salary values.

tree_fit_2 <- rpart(Salary ~., data = training)
tree_preds <- predict(tree_fit_2, newdata = testing) %>%
  bind_cols(testing)
New names:
• `` -> `...1`
tree_preds
# A tibble: 663 × 10
     ...1   Age Gender `Education Level` `Job Title`       `Years of Experience`
    <dbl> <dbl> <chr>              <dbl> <chr>                             <dbl>
 1 37293.    24 Male                   0 Sales Associate                       1
 2 37293.    28 Female                 0 Sales Associate                       1
 3 37293.    30 Female                 0 Sales Associate                       1
 4 37293.    21 Female                 0 Sales Representa…                     0
 5 37293.    21 Female                 0 Sales Representa…                     0
 6 37293.    24 Female                 0 Receptionist                          0
 7 37293.    24 Female                 0 Receptionist                          0
 8 37293.    24 Female                 0 Receptionist                          0
 9 37293.    24 Female                 0 Receptionist                          0
10 37293.    24 Female                 0 Receptionist                          0
# ℹ 653 more rows
# ℹ 4 more variables: Salary <dbl>, Country <chr>, Race <chr>, Senior <dbl>

While the predictions appear to be fairly accurate to the actual values, we can see that the model is not good at predicting small changes within similar records. Therefore, we need to tune for it to factor in these smaller changes into the data.

# Tidy way + tuning   
tree_model <- decision_tree(mode = "regression",
                            cost_complexity = tune(),
                            tree_depth = tune()) %>%
  set_engine("rpart")


data_recipe <- recipe(Salary ~., training)

wf <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(tree_model)

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = 5)

cv_samples <- vfold_cv(training)

tree_tune <- wf %>%
  tune_grid(
    resamples = cv_samples,
    grid = tree_grid
  )

best_tree <- tree_tune %>%
  select_best(metric = "rmse")

final_wf <- wf %>%
  finalize_workflow(best_tree)
  

final_wf %>%
  last_fit(salary_split) %>%
  collect_metrics() 
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard   10533.    Preprocessor1_Model1
2 rsq     standard       0.960 Preprocessor1_Model1
tuned_tree_preds <- final_wf %>%
  last_fit(salary_split) %>%
  collect_predictions() %>%
  bind_cols(testing)
New names:
• `Salary` -> `Salary...4`
• `Salary` -> `Salary...11`

As we can see from the output of the tuned decision tree above, we get an r-squared value of .958, which is an incredible accuracy considering decision trees often suffer from low predictive power. However, our RMSE value is at a staggering 10491.67, so our outliers are heavily impacting the model in a negative way, which is usually the case for Decision Trees.

tuned_tree_preds
# A tibble: 663 × 14
    .pred id              .row Salary...4 .config   Age Gender `Education Level`
    <dbl> <chr>          <int>      <dbl> <chr>   <dbl> <chr>              <dbl>
 1 30667. train/test sp…    15      25000 Prepro…    24 Male                   0
 2 26310. train/test sp…    22      25000 Prepro…    28 Female                 0
 3 26310. train/test sp…    42      25000 Prepro…    30 Female                 0
 4 25136. train/test sp…    45      25000 Prepro…    21 Female                 0
 5 25136. train/test sp…    56      25000 Prepro…    21 Female                 0
 6 25136. train/test sp…    99      25000 Prepro…    24 Female                 0
 7 25136. train/test sp…   107      25000 Prepro…    24 Female                 0
 8 25136. train/test sp…   111      25000 Prepro…    24 Female                 0
 9 25136. train/test sp…   126      25000 Prepro…    24 Female                 0
10 25136. train/test sp…   128      25000 Prepro…    24 Female                 0
# ℹ 653 more rows
# ℹ 6 more variables: `Job Title` <chr>, `Years of Experience` <dbl>,
#   Salary...11 <dbl>, Country <chr>, Race <chr>, Senior <dbl>

Looking at our predicted values now, we can see that the model is way more accurate at factoring in slight differences between similar employees. Overall, this tuned regression decision tree does a really good job of making accurate predictions.

Random Forest Tree:

For comparison, let us look at this Random Forest Tree

rf_model <- rand_forest() %>% 
    set_engine("ranger") %>% 
    set_mode("regression")

# workflow
rf_wf <- workflow() %>% 
    add_model(rf_model) %>% 
    add_recipe(data_recipe)

# fit the regression tree
rf_fit <- rf_wf %>% fit(training)

# predict
testing$pred <- predict(rf_fit, testing)$.pred

# metrics
testing %>% metrics(Salary, pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   11885.   
2 rsq     standard       0.952
3 mae     standard    7964.   

The Random Forest tree did ever so slightly worse than the tuned decision tree model, but it still is very accurate at predicting salary.

Ridge Regression:

We chose ridge regression as our final model in the hopes that we could reduce the high amount of variance in our data to create an even more accurate model than our tuned Decision Tree.

Let us start with a ridge model that we manually assign the penalties for. Let us use a manual penalty of 4 for the estimate. We must also center and scale all of our predictors to standardize them before we fit the model.

ridge_recipe <- recipe(Salary ~ ., data = train) %>%
  step_center(all_nominal_predictors()) %>%
  step_scale(all_nominal_predictors())


ridge_model <- linear_reg(mixture = 0, penalty = .1) %>%
  set_engine("glmnet")

ridge_wf <- workflow() %>%
  add_recipe(ridge_recipe) %>%
  add_model(ridge_model) %>%
  fit(train)
extract_fit_parsnip(ridge_wf) %>% tidy(penalty = 4)
# A tibble: 9 × 3
  term                estimate penalty
  <chr>                  <dbl>   <dbl>
1 (Intercept)          38827.        4
2 Age                    293.        4
3 Gender                5830.        4
4 Education Level      15021.        4
5 Job Title             -114.        4
6 Years of Experience   5034.        4
7 Country               -345.        4
8 Race                   -71.2       4
9 Senior               -4810.        4

From the output of the model, we can tell that it is not very accurate at all. The estimated values are extremely far away from zero.

Now, let us try tuning the model to see if we can improve the accuracy of the ridge regression.

## TUNING
folds <-vfold_cv(train)

model <- linear_reg(mixture = 0, penalty = tune()) %>%
  set_engine("glmnet")

tuned_wf <- workflow() %>%
  add_recipe(ridge_recipe) %>%
  add_model(ridge_model)

ridge_grid <- grid_regular(mixture(), penalty(), levels = 10)

tuned_grid <- tune_grid(tuned_wf, resamples = folds, grid = ridge_grid)
Warning: No tuning parameters have been detected, performance will be evaluated
using the resamples with no tuning. Did you want to [tune()] parameters?
tuned_grid %>% collect_metrics() %>% filter(.metric == "rmse") %>% arrange(mean)
# A tibble: 1 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 rmse    standard   29001.    10    359. Preprocessor1_Model1

The RMSE is almost three times larger than our decision tree model. It appears this model is not accurate at all at predicting salary.

Before we make any assumptions, let us take a look at the predictions

tuned_grid %>% 
    select_best() %>% 
    finalize_workflow(tuned_wf, .) %>% 
    last_fit(ridge_split) %>% 
    collect_predictions()
Warning in select_best(.): No value of `metric` was given; "rmse" will be used.
# A tibble: 663 × 5
    .pred id                .row Salary .config             
    <dbl> <chr>            <int>  <dbl> <chr>               
 1 53233. train/test split    10  25000 Preprocessor1_Model1
 2 48988. train/test split    12  25000 Preprocessor1_Model1
 3 50212. train/test split    33  25000 Preprocessor1_Model1
 4 50489. train/test split    34  25000 Preprocessor1_Model1
 5 50557. train/test split    35  25000 Preprocessor1_Model1
 6 42141. train/test split    55  25000 Preprocessor1_Model1
 7 42711. train/test split    59  25000 Preprocessor1_Model1
 8 43567. train/test split    69  25000 Preprocessor1_Model1
 9 43421. train/test split    71  25000 Preprocessor1_Model1
10 43027. train/test split    75  25000 Preprocessor1_Model1
# ℹ 653 more rows

Our tuned ridge regression model overestimates salary for every employee. It is now safe to say that this model is the least accurate out of the three that we have created today.

Comparison:

To compare our models, our decision tree by far did the best, as we have previously stated, but our multiple linear regression model was still respectable, being able to predict accurately within 82% of the data. Now for the ridge regression model. Our ridge regression model was not accurate even after being scaled, centered, and tuned. We are led to believe that this may have been due to the extremely large variance within the dataset.

Conclusion:

In conclusion, we were able to answer all of our questions after analyzing and modeling the data.

Starting with our minor questions:

  • Women do, in fact, get paid less than men; while men do have lower-paying jobs than women, on average their jobs are likely to pay less than a man’s.

  • Age does play a large role in how much an employee earns. experience and age go hand in hand with one another, as you are going to gain experience as you age (unless you are unemployed or start work later than the average person). Still, being older in your field almost certainly leads to better pay. We did find, however, that 60-year-olds make about the same as 50-year-olds do on average. So do not anticipate a pay raise heading into your pre-retirement years

  • Having a senior-level position does indeed lead to a pay increase on average, and while we found a handful of outliers under 30, most employees in a senior-level position were older than this mark.

  • Having a higher level of education does lead to a higher salary, and quite significantly so. We would hope this would be the case considering the amount of time and resources it takes to get each higher level of education.

  • No, you do not need to move to another country to get a better wage. While there may be other reasons (such as benefits) to entice you to move abroad, salary should not be one of them.

To finish off this project, let us answer our main question: Can we accurately predict the salary of an employee given the predictors from the dataset?

The answer to this question is yes. Using a tuned decision tree model we were able to achieve an accuracy of 95% on our testing data. The model is not entirely perfect, but it is certainly good for the fact that it is predicting using regression, which is extremely hard to achieve good accuracy for.

To say the accuracy of our decision tree was a surprise would be an understatement. Considering the relatively small amount of variables within the data set we thought we would not be able to accurately predict salary, so to create such an accurate model was a pleasant surprise for us.