library(tidyverse)
library(skimr)      #skim
library(gdata)      #trim
library(ggpubr)     #ggarrange
library(scales)     #formatting numbers in ggplot axis 
library(cowplot)    #density chart
library(outliers)   #scores
library(qdapTools)  #mtabulate
library(ggcorrplot)   #corrplot for categorical data
library(correlationfunnel)
library(usmap)
library(caret) #the all in one model fitting package

Introduction

Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data (Federal Reserve Economic Data: https://fred.stlouisfed.org/release/tables?eid=257197&rid=110) to predict if the salaries in the survey end up above or below per capita personal income for state. The data itself is a bit messy being a real-world survey dataset, so this will take a decent amount of data transformation and cleaning. After cleaning and preparation, we want to build classification models consisting of a logit model, an svm model, and a neural network model. The end goal of this analysis is to build a model that is best able to predict if someone should have above or below the median state income with their personal qualifications that can be utilized by workers to be able to determine factors that contribute to being paid above or below the median state income and allow businesses to determine if they should be paying a work above or below median state income based on their qualifications.

Note that the data processing steps are a shared effort with another CUNY MSDS student, Diana Plunkett, these processing steps were taken in a team project and now I would like to try different models with this data.

Loading Data

Here we load:

  • the Salary Survey data
  • the Personal Income Per Capita by State
#Loading Data
sal_df <- as.data.frame(read.csv('Ask A Manager Salary Survey 2021 (Responses).csv'))
#load in the per capita income data.
state_inc<-as.data.frame(read.csv('PersonalIncomePerCapita.csv'))

Column Names

Before we do anything else, we want to rename the columns in the Salary Survey Data. As is typical of survey data, the column names are the full text of the questions, and as such are long and unwieldy. It would be cumbersome to do anything in this form.

names(sal_df)<-c('timestamp','age','industry','job_title', 'title_context','annual_sal','add_comp','ccy','other_ccy','income_context','ctry','state','city','yrs_wrk_exp','yrs_in_field_exp','edu','gender', 'race')

names(state_inc)<-c('state','PIncPerCap')

Description of Variables

Below is table showing the variables in the Salary Survey data set:

Variable xxxx Survey Question
timestamp Timestamp on the survey response
age How old are you?
industry What industry do you work in?
job_title Job title
title_context If your job title needs additional context, please clarify here
annual_sal What is your annual salary? (You’ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent – what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
add_comp How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
ccy Please indicate the currency
other_ccy If “Other,” please indicate the currency here
income_context If your income needs additional context, please provide it here
ctry What country do you work in?
state If you’re in the U.S., what state do you work in?
city What city do you work in?
yrs_wrk_exp How many years of professional work experience do you have overall?
yrs_in_field_exp How many years of professional work experience do you have in your field?
edu What is your highest level of education completed?
gender What is your gender?
race What is your race? (Choose all that apply.)

And a table showing the variables in the Personal Income Per Capita by State:

Variable xxxx Description
state The name of the US State
PIncPerCap The personal income per capita for that state, for 2022, per FRED (Federal Reserve Economic Data)

Data Exploration

Let’s take a glimpse at the Salary Survey Data. There are 27,992 rows and 18 columns.

### Data Preprocessing 
glimpse(sal_df)
## Rows: 27,992
## Columns: 18
## $ timestamp        <chr> "4/27/2021 11:02:09", "4/27/2021 11:02:21", "4/27/202…
## $ age              <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34",…
## $ industry         <chr> "Education (Higher Education)", "Computing or Tech", …
## $ job_title        <chr> "Research and Instruction Librarian", "Change & Inter…
## $ title_context    <chr> "", "", "", "", "", "", "", "High school, FT", "Data …
## $ annual_sal       <int> 55000, 54600, 34000, 62000, 60000, 62000, 33000, 5000…
## $ add_comp         <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0…
## $ ccy              <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "USD…
## $ other_ccy        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ income_context   <chr> "", "", "", "", "", "", "", "", "", "I work for a Cha…
## $ ctry             <chr> "United States", "United Kingdom", "US", "USA", "US",…
## $ state            <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "South…
## $ city             <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "G…
## $ yrs_wrk_exp      <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 y…
## $ yrs_in_field_exp <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years",…
## $ edu              <chr> "Master's degree", "College degree", "College degree"…
## $ gender           <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "Ma…
## $ race             <chr> "White", "White", "White", "White", "White", "White",…

Data Preprocessing

Before beginning our exploration, we need to take some steps to create the data to be modeled. We will:

  • convert the empty stings to NA
  • make the category fields factors and order the values
  • limit to responses from the US
  • make sure state is in the form needed to merge with the personal income per capita from the state_inc data
  • merge the state_inc data
  • sum the annual_sal and the add_comp to get a total_comp
  • create a column that indicates if total_comp is above or below the state per capita income

Once these steps are complete, we can explore our data and do any data preparation that might be advantageous in our modeling.

Empty Strings to NA

Many columns have empty strings that should be NA.

#empty strings to NA
sal_df[sal_df == ''] <- NA

Factors and Order

These columns are category fields to made into factors: age, yrs_wrk_exp, yrs_in_field_exp, edu and gender.

#make factors
sal_df |>
  mutate(
    across(
      c(age, yrs_wrk_exp, yrs_in_field_exp, edu, gender),
      as_factor
    )
  )  -> sal_df

sal_df |>
  dplyr::select(where(is.factor)) |>
  sapply(levels)
## $age
## [1] "25-34"      "45-54"      "35-44"      "18-24"      "65 or over"
## [6] "55-64"      "under 18"  
## 
## $yrs_wrk_exp
## [1] "5-7 years"        "8 - 10 years"     "2 - 4 years"      "21 - 30 years"   
## [5] "11 - 20 years"    "1 year or less"   "41 years or more" "31 - 40 years"   
## 
## $yrs_in_field_exp
## [1] "5-7 years"        "2 - 4 years"      "21 - 30 years"    "11 - 20 years"   
## [5] "1 year or less"   "8 - 10 years"     "31 - 40 years"    "41 years or more"
## 
## $edu
## [1] "Master's degree"                    "College degree"                    
## [3] "PhD"                                "Some college"                      
## [5] "High School"                        "Professional degree (MD, JD, etc.)"
## 
## $gender
## [1] "Woman"                         "Non-binary"                   
## [3] "Man"                           "Other or prefer not to answer"
## [5] "Prefer not to answer"

Looking at these values, it is clear that all but gender can be ordered.

#order factors
sal_df |>
  mutate(
    age = ordered(age, 
                  c("under 18", 
                    "18-24", 
                    "25-34", 
                    "35-44", 
                    "45-54", 
                    "55-64", 
                    "65 or over")),
    yrs_wrk_exp = ordered(yrs_wrk_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    yrs_in_field_exp = ordered(yrs_in_field_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    edu = ordered(edu, 
                    c("High School", 
                      "Some college", 
                      "College degree", 
                      "Master's degree", 
                      "PhD", 
                      "Professional degree (MD, JD, etc.)")),
  ) -> sal_df

Limiting to US only

We will remove any rows not in the US. Part of the messiness of this data is a variety of ways a single country is defined: United States, US, USA etc. We keep it simple and eliminate rows where state=NA. We suspect from the city that 171 rows where state=NA seem to be from the US, and while in some cases the state might be derived from city, in others it is not clear. Since eliminating the rows without a state leaves us with almost 23K rows, this seems the best solution.

After removing all the non US rows, we have 22,995 rows and 17 columns. We removed the ctry column, as it now no longer added value.

#create a new df sal_US with only rows where state != NA
sal_US <- sal_df[!is.na(sal_df$state),]
#remove the ctry column, all rows are now US
sal_US = subset(sal_US, select = -c(ctry) )

State

Taking a look at state, there are 106 cases where multiple states are selected. This is small compared to the total number of rows. We eliminate them we and still have almost 23K rows (22,889).

#unique state shows 132 different values, just showing tail for clarity. 
tail(unique(sal_US$state))
## [1] "Mississippi, Missouri"                                                        
## [2] "California, Maryland"                                                         
## [3] "California, Illinois, Massachusetts, North Carolina, South Carolina, Virginia"
## [4] "Alabama, California"                                                          
## [5] "Michigan, Texas, Washington"                                                  
## [6] "Alabama, Oregon"
#how many rows have a ',', eg multiple states
length(which(grepl(',', sal_US$state)))
## [1] 106
#delete the rows with multiple states
sal_US <- sal_US[!grepl(',', sal_US$state),]

Merge Data

Now that the state column is clean, we can merge the two data frames, bringing in the per capita income by state to each row.

#merge rows 
sal_US <- merge(sal_US, state_inc, by ='state', all.x=TRUE)

head(sal_US |> distinct(state,PIncPerCap))
##        state PIncPerCap
## 1    Alabama      50916
## 2     Alaska      68635
## 3    Arizona      58442
## 4   Arkansas      52618
## 5 California      77036
## 6   Colorado      75722

Target Column for high compensation

We want to create the target column indicating if the salary is above or below Personal Income Per Capita. The steps to do this are:
- sum the annual_sal and add_comp fields to make a total_comp field
- create a column (hi_comp) that compares total_comp and inc_per_cap and set value to 1 if the total_comp is greater or equal and 0 if total_comp is less.

#sum the annual_sal and add_comp
sal_US$total_comp <- rowSums(sal_US[, c('annual_sal','add_comp')], na.rm=TRUE)

#create the hi_comp field
sal_US <- sal_US |> mutate(hi_comp=if_else(total_comp>PIncPerCap,1,0))
sal_US$hi_comp <- as_factor(sal_US$hi_comp)

head(sal_US |> select (annual_sal, add_comp,total_comp,PIncPerCap, hi_comp))
##   annual_sal add_comp total_comp PIncPerCap hi_comp
## 1     120000       NA     120000      50916       1
## 2      70000       NA      70000      50916       1
## 3     120000    15000     135000      50916       1
## 4     175000       NA     175000      50916       1
## 5      40000        0      40000      50916       0
## 6      22800       NA      22800      50916       0

Looking at a bar graph of the distribution of compensation classes, we can see that our survey over represents those in the above median income class. We will take this information into account when building models by balancing the dataset.

theColors <- c('#6aaec8','#52985e')

ggplot(sal_US, aes(x=hi_comp, fill=hi_comp)) +
  geom_bar (aes(y = (after_stat(count))/sum(after_stat(count)))) +
  ylab("Percent")+
  scale_fill_manual(values=theColors)+
    theme_classic()

Data Skimming

Now we have the data we want to model. Let’s skim for a quick view of summary statistics, and a visualization of distributions with mini histograms.

After all our pre-proccessing, we have 22,889 responses (rows) with 20 variables.

We notice here that there are many missing values:

  • 16,905 rows are missing additional context to the title (title_context).
  • 22,851 missing values for other currency (other_ccy)
  • 20,408 missing values for additional context for income (income_context)
  • 5,720 missing values for additional compensation (add_comp)
  • small number of rows missing in city (12), education level (edu - 138), gender (130), industry (56), and race (130).

We also see that the numeric data is mostly skewed right, except for PIncPerCap, which is rather normal.

#Data Skimming
skim(sal_US)
Data summary
Name sal_US
Number of rows 22889
Number of columns 20
_______________________
Column type frequency:
character 10
factor 6
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1.00 4 20 0 51 0
timestamp 0 1.00 16 19 0 20899 0
industry 56 1.00 2 171 0 1022 0
job_title 0 1.00 1 126 0 11916 0
title_context 16905 0.26 1 808 0 5771 10
ccy 0 1.00 3 7 0 8 0
other_ccy 22851 0.00 1 100 0 24 0
income_context 20408 0.11 1 939 0 2435 2
city 12 1.00 1 84 0 3710 6
race 130 0.99 5 125 0 45 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age 0 1.00 TRUE 7 25-: 10180, 35-: 8148, 45-: 2646, 18-: 960
yrs_wrk_exp 0 1.00 TRUE 8 11 : 7978, 8 -: 4371, 5-7: 3924, 21 : 3001
yrs_in_field_exp 0 1.00 TRUE 8 11 : 5401, 5-7: 5363, 2 -: 4944, 8 -: 4130
edu 138 0.99 TRUE 6 Col: 11148, Mas: 7363, Som: 1612, Pro: 1119
gender 130 0.99 FALSE 5 Wom: 17755, Man: 4187, Non: 590, Oth: 226
hi_comp 0 1.00 FALSE 2 1: 14389, 0: 8500

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
annual_sal 0 1.00 92025.95 75165.15 0 56160 78000 112000 5000044 ▇▁▁▁▁
add_comp 5720 0.75 12700.58 46465.79 0 0 2000 10000 1500000 ▇▁▁▁▁
PIncPerCap 0 1.00 69626.18 10301.27 46370 61475 68840 77036 95970 ▁▇▇▅▁
total_comp 0 1.00 101552.64 93505.36 0 58000 81000 120000 5004044 ▇▁▁▁▁

Missing Values

Let’s look at how we might address the missing values.

industry (56)

Industry looks like it was a free text field, and there are over 1,000 different values. Not super helpful. A bit of research reveals a grouping of industry sectors as follows:

  • Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
  • Manufacturing and Construction
  • Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
  • Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
  • Human Services (eg Government, Nonprofits and NGOs)
    We will attempt to group all the industries into one of these groups in a new column - indust_grp.

A sample of the industry values:

head(unique(sal_US$industry))
## [1] "Computing or Tech"                       
## [2] "Federal Contracting/Business Development"
## [3] "Government and Public Administration"    
## [4] "Law"                                     
## [5] "STEM medical research"                   
## [6] "Education (Higher Education)"

Showing the first few job_title where NA industry:

head(sal_US$job_title[is.na(sal_US$industry)])
## [1] "Adult Services Librarian"             
## [2] "Cybersecurity senior risk specialist "
## [3] "Senior Manager of Ops Strategy"       
## [4] "UX Designer"                          
## [5] "Lab Assistant (Level II)"             
## [6] "People Operations Manager "

title_context (16,905)

This field is to provide further explanation if the job_title requires such. Again, this is a free text field, and though most values are NA there are still almost 6K responses, likely to be unique.

Our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the job title (1) or not (0). One thing to consider here is that some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify this case).

head(unique(sal_US$title_context))
## [1] NA                                                                  
## [2] "Senior associate"                                                  
## [3] "Mid- level professional biotechnology laboratory researcher "      
## [4] "na"                                                                
## [5] "Make graphics and videos, write captions post them on social media"
## [6] "Commercial design (not residential)"

other_ccy (22,851)

As we eliminated non US country rows, we would expect the ccy to be all USD, and therefore for other_ccy to be ALL NAs. Instead, there are 13 rows where other_ccy is populated. Look at the values in ccy, we find the assumption that it would all be USD is wrong, as 26 have another ccy.

table(sal_US$ccy)
## 
## AUD/NZD     CAD     EUR     GBP     JPY   Other     SEK     USD 
##       1       5       4       1       1      13       1   22863

Looking at what is in other_ccy, we find that there are a few USD or variations (US Dollar, American Dollars), a few that are another currency or equity/stock. But most seem to be answering a different question.

Since the comparison to the state personal income per capita requires that the total_comp be in USD, we will ‘correct’ the 3 types of ‘others’ that are clearly USD, and then eliminate the other rows that are not USD (or not identifiable as such).

tail (unique(sal_US$other_ccy))
## [1] "Na "                                                 
## [2] "My bonus is based on performance up to 10% of salary"
## [3] "Equity"                                              
## [4] "N/a"                                                 
## [5] "Stock "                                              
## [6] "American Dollars"

income_context (20,408)

Similar to title_context, the purpose of this variable is to provide further explanation if the annual_sal, add_comp and ccy fields requires such. Again, this is a free text field, and though most values are NA, there are still almost 2.5K responses, likely to be unique.

As with title_context, our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the income (1) or not (0). Here again, some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify).

head(unique(sal_US$income_context))
## [1] NA                                                                                                                         
## [2] "Need to Pay Fees"                                                                                                         
## [3] "na"                                                                                                                       
## [4] "24.2/hr"                                                                                                                  
## [5] "Clinical pharmacist in a small hospital"                                                                                  
## [6] "My bonuses are based on my fees in cases and it’s never a consistent amount. Ranges from $10,000-$50,000 per year so far."

city (12)

We notice that in addition to the 12 actual NAs, there are 32 rows with “na”s and another 92 with some variation of “prefer not to answer”.

We are not sure how city might effect the model. Some cities might have higher salaries, but since again, this is free text, there are over 2K unique responses.

Since there are almost as many unique responses as there are rows, and we cannot think of a way to group these, we will eliminate this column.

head(unique(sal_US$city))
## [1] "Birmingham"  "Huntsville"  "Birmingham " "Auburn"      "Athens"     
## [6] "na"

Demographic variables: race, edu, gender

There are very similar count of NAs for race, edu and gender (130, 138 and 130, respectively).

There are about 20 or so responses where all three of these are blank. Makes sense to get rid of those rows.

For the rest:

  • gender has both “Other or prefer not to answer” and “Prefer not to answer” as well as NAs. Will combine those into a single “Prefer not to answer”
  • similarly, will add replace the NAs in edu to “Prefer not to answer”.
  • race will take a bit more work, as it is a check all that apply. So in addition to replacing the NAs with “Prefer not to answer”, we also have to split out the field into who chose each of the 7 options provided (one of which is “Another option not listed here or prefer not to answer” - so the NAs will go there).

Note that there are 6 options shown below, but there is also a 7th: ‘Hispanic, Latino, or Spanish origin’, which only appears along with other options selected.

#### Demographic variables
unique(sal_US[(!grepl(',', sal_US$race)), c('race')])
## [1] "White"                                                 
## [2] "Asian or Asian American"                               
## [3] "Another option not listed here or prefer not to answer"
## [4] "Native American or Alaska Native"                      
## [5] "Black or African American"                             
## [6] "Middle Eastern or Northern African"                    
## [7] NA

We now have a plan for what data preparation is needed for missing values. Let’s explore other aspects of the data.

Variable Distribution

The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.

In our first look, visualizing distribution is impacted by the very long tails.

#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
pa <- sal_US |> 
    ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pb <- sal_US |> 
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pc <- sal_US |> 
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
pd <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(pa,pb,pc,pd)

The 75 percentile for annual_sal is less than $150,000, while the max is over $5,000,000. There are a very small (less than 20) number of rows where either annual_sal or add_comp are greater than 1 million. In order to see what the majority (first 75%) of the data looks like, we will filter annual_sal, add_comp and total_comp <150,000.

When we do this, we can see that the large majority of the data is pretty normal. The right skewness for annual_sal, add_comp and total_comp come from the small number of very high values.

PIncPerCap looks multimodal.

#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
p1 <- sal_US |> 
    filter(annual_sal < 150000) |>
  ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p2 <- sal_US |> 
    filter(add_comp < 150000) |>
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p3 <- sal_US |> 
    filter(total_comp < 150000) |>
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
p4 <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(p1,p2,p3,p4)

Target Flag Breakdown

We want to take a look into what contributes to having a total_comp greater than the PIncPerCap. Look at the distribution of whose total_comp does and who does not by the qualitative variables that we have.

This shows that the percent of respondents with hi_comp=1 (total_comp is greater than PIncPerCap):

  • generally increases with age
  • increases with both yrs_wrk_exp and yrs_in_field_exp until about 30 years then decreases.
  • increases with education levels
  • gender is a bit all over, but needs some clean up to combine NA, and the two variations of “Prefer not to answer”
theColors <- c('#6aaec8','#52985e')

p5<-ggplot(sal_US, aes(fill=hi_comp, x=age), axis.title.y=element_blank()) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Age")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p6<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_wrk_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years Work Exp")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p7<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_in_field_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years In Field")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p8<-ggplot(sal_US, aes(fill=hi_comp, x=edu)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Education")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p9<-ggplot(sal_US, aes(fill=hi_comp, x=gender)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Gender")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank(), legend.title=element_text(size=8))

ggarrange(p5,p6,p7,p8,p9, nrow=3, ncol = 2)

Target Amount Breakdown

Again here, we look at the lower 3 percentiles. (total_Comp<150000)

# Target Amount Breakdown
# density plots
multdens <- function(df, var, target) {
  
  df |> 
    dplyr::select(annual_sal, add_comp, total_comp, hi_comp) |> 
    filter(!is.na(.data[[var]])) |> filter(total_comp < 150000) |>
    ggplot(aes(.data[[var]], fill = .data[[target]], color = .data[[target]])) +
      geom_density(alpha = 0.4) +
    scale_fill_manual(values=theColors)+
    scale_color_manual(values=theColors)+
      labs(y = NULL) +
      theme_minimal(base_size = 6) +
      theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}

num_col <- sal_US |>
  dplyr::select(annual_sal, add_comp, total_comp) |>
  colnames()

dens <- lapply(num_col, multdens, df = sal_US, target = "hi_comp")

plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)

Maps

To understand how the Personal Income Per Capita varies by state, we can plot it on a map.

Because it is so small, it is hard to see that DC has the highest per capita personal income, $95,970. But you can see that this is followed by Massachusetts (MA) $84,561 and Connecticut (CT) $82,938. Of course the per capita calculation will be greatly impacted by population, especially the highest population states (California, Texas, Florida and New York) and lowest population states (Wyoming, Vermont, DC, and Alaska).

# map of per capita personal income 
map<-plot_usmap(data= state_inc, values ='PIncPerCap', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Personal Income Per Capita', label = scales::comma) + 
  theme(legend.position = "right")

# Set label font size
map$layers[[2]]$aes_params$size <- 1.5
print(map)

We are also interested to see how many responses there are in each state. To do this, we create a small table with just the state and the count of rows with that state in the sal_US data.

California (CA) 2,607 and New York (NY) 2,168 are the top two. Wyoming (WY) 24 and South Dakota (SD) 25 are at the bottom. Every state (+ DC) are represented.

#first we want to gather the data to map
occ<-as.data.frame(table(sal_US$state))
names(occ)<-c('state','responses')
#map of responses by state
map_2<-plot_usmap(data= occ, values ='responses', color = '#52985e', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#03688f',name = 'Responses', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_2$layers[[2]]$aes_params$size <- 1.5
print(map_2)

For our third map, we are interested in the mean of the total_comp reported by state. As we did above, we will create a small table to capture the data, then map it.

Top 3 are California, $135,362.24 (CA), Washington, $127,442.40 (WA) and New York, $117,769.82 (NY). At the other end of the spectrum are Wyoming, $60,475.96 (WY), Montana, $63,200.66 (MT) and Mississippi, $63,929.22 (MS).

#first we want to gather the data to map
avg<-sal_US |>
  group_by(state) |>
  summarise_at(vars(total_comp), list(name = mean))
names(avg)<-c('state','avg_comp')
#map of responses by state
map_3<-plot_usmap(data= avg, values ='avg_comp', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Mean Total Comp', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_3$layers[[2]]$aes_params$size <- 1.5
print(map_3)

Data Preparation

Now that we have taken a good look at our data, we can begin to prepare it to be useful in modeling to both predict when total_comp is greater than personal income per capita and the predict actual total_comp.

In order to be able to compare to our (pre-processed but still) baseline data, we will create a new data frame, sal_US1.

# Data Preparation
# new data frame for data prep 
sal_US1 <- sal_US

Missing Values

Some missing values will actually be addressed by creating new categories, so those will be discussed in the following section.

ccy

To make valid comparisons, we want all responses to be reported in USD. As described above, there are a few rows with ccy = ‘other’ where the value in other_ccy make it clear it is USD. We will update those and then delete the 23 non-USD rows. Once that is complete, we no longer need ccy or other_ccy, so we will remove those columns.

#correct the others that are really USD
sal_US1 <- sal_US1 |>
  mutate (ccy = ifelse(ccy=='Other' & (other_ccy=='USD' | other_ccy == "US Dollar" | other_ccy == "American Dollars"),'USD', ccy))

#Keep only the rows where ccy=USD
sal_US1 <- sal_US1[sal_US1$ccy =='USD',]

table(sal_US1$ccy)
#remove ccy anc other_ccy columns as now all USD
sal_US1 = subset(sal_US1, select = -c(ccy, other_ccy))

city

Recall city this is free text, with almost as many unique values as rows, and no good way to categorize it. We do not need to deal with the NAs, as we will remove this column.

#remove city column
sal_US1 = subset(sal_US1, select = -c(city))

New Categorical Columns

Industry Group

As described above, we will group industry into 5 categories as follows:

  • Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
  • Manufacturing and Construction Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
  • Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
  • Human Services (eg Government, Nonprofits and NGOs)

This will take some iterative effort to identify the group for all the many many varied values of the free text field industry.

#make indust_grp field
sal_US1 <- sal_US1 %>% mutate (indust_grp = 
            case_when (
#Raw Materials
              grepl('Forest', industry) ~ 'Raw Materials',
              grepl('Mining', industry) ~ 'Raw Materials',
              grepl('Oil', industry) ~ 'Raw Materials',
              grepl('oil', industry) ~ 'Raw Materials',
              grepl('Energy', industry) ~ 'Raw Materials',
              grepl('energy', industry) ~ 'Raw Materials',
              grepl('Petroleum', industry) ~ 'Raw Materials',
              grepl('Agriculture', industry) ~ 'Raw Materials',              grepl('Fisher', industry) ~ 'Raw Materials',
              
#Manufacturing and Construction             
              grepl('Construction', industry) ~ 'Manufacturing and Construction',
              grepl('Contracting', industry) ~ 'Manufacturing and Construction',
              grepl('Manufac', industry) ~ 'Manufacturing and Construction',
              grepl('MANUFAC', industry) ~ 'Manufacturing and Construction',
              grepl('manufac', industry) ~ 'Manufacturing and Construction',
              grepl('Mfg', industry) ~ 'Manufacturing and Construction',
              grepl('coat', industry) ~ 'Manufacturing and Construction',
              grepl('Wine', industry) ~ 'Manufacturing and Construction',              
              grepl('Packaged Goods', industry) ~ 'Manufacturing and Construction',
            grepl('packaged goods', industry) ~ 'Manufacturing and Construction',
            grepl('Goods', industry) ~ 'Manufacturing and Construction',
            grepl('Consumer', industry) ~ 'Manufacturing and Construction',
            grepl('consumer', industry) ~ 'Manufacturing and Construction',
            grepl('Commercial furniture', industry) ~ 'Manufacturing and Construction',
            grepl('Industrial', industry) ~ 'Manufacturing and Construction',
            grepl('Craft Beer Industry', industry) ~ 'Manufacturing and Construction',
            grepl('Beverage', industry) ~ 'Manufacturing and Construction',
            grepl('Brew', industry) ~ 'Manufacturing and Construction',
            grepl('Chemicals', industry) ~ 'Manufacturing and Construction',
            grepl('Automotive', industry) ~ 'Manufacturing and Construction',
#Information Services             
              grepl('Educ', industry) ~ 'Information Services',
              grepl('educ', industry) ~ 'Information Services',
              grepl('acad', industry) ~ 'Information Services',
              grepl('learn', industry) ~ 'Information Services',
              grepl('School', industry) ~ 'Information Services',
              grepl('Acad', industry) ~ 'Information Services',
              grepl('College', industry) ~ 'Information Services',
              grepl('Tech', industry) ~ 'Information Services',
              grepl('tech', industry) ~ 'Information Services',
              grepl('Anal', industry) ~ 'Information Services',
              grepl('development', industry) ~ 'Information Services',

              grepl('Title IX', industry) ~ 'Information Services',
              grepl('Publication', industry) ~ 'Information Services',              
              grepl('edit', industry) ~ 'Information Services',     grepl('Publish', industry) ~ 'Information Services',
              grepl('publish', industry) ~ 'Information Services',
grepl('Ecology', industry) ~ 'Information Services', 
grepl('R&D', industry) ~ 'Information Services',              
grepl('Info', industry) ~ 'Information Services',
              grepl('Librar', industry) ~ 'Information Services',
              grepl('LIbrar', industry) ~ 'Information Services',
              grepl('librar', industry) ~ 'Information Services',
              grepl('LIBRAR', industry) ~ 'Information Services',

              grepl('Archivist', industry) ~ 'Information Services',
              grepl('Reporter', industry) ~ 'Information Services',
              grepl('Engineer', industry) ~ 'Information Services',
              grepl('engineer', industry) ~ 'Information Services',
              grepl('Copywriter', industry) ~ 'Information Services',
              grepl('Develop', industry) ~ 'Information Services',
              grepl('Edit', industry) ~ 'Information Services',
              grepl('Design', industry) ~ 'Information Services',
              grepl('Teach', industry) ~ 'Information Services',
              grepl('Train', industry) ~ 'Information Services',
              grepl('Instruct', industry) ~ 'Information Services',
              grepl('Research', industry) ~ 'Information Services',
              grepl('research', industry) ~ 'Information Services',
              grepl('Media', industry) ~ 'Information Services',
              grepl('STEM', industry) ~ 'Information Services',
              grepl('Think tank', industry) ~ 'Information Services',
              grepl('Marketing', industry) ~ 'Information Services',
              grepl('marketing', industry) ~ 'Information Services',
              grepl('Communica', industry) ~ 'Information Services',
              grepl('market research', industry) ~ 'Information Services',
              grepl('Data', industry) ~ 'Information Services',
              grepl('data', industry) ~ 'Information Services',
              grepl('content', industry) ~ 'Information Services',
              grepl('Content', industry) ~ 'Information Services',
              grepl('CPG', industry) ~ 'Information Services',
              grepl('IT', industry) ~ 'Information Services',
              grepl('Scien', industry) ~ 'Information Services',
              grepl('science', industry) ~ 'Information Services',
              grepl('Video', industry) ~ 'Information Services',
              grepl('Internet', industry) ~ 'Information Services',
            grepl('Archaeology', industry) ~ 'Information Services',
              grepl('Bio', industry) ~ 'Information Services',
              grepl('chemistry', industry) ~ 'Information Services',
              grepl('student', industry) ~ 'Information Services',
              grepl('Student', industry) ~ 'Information Services',
              grepl('Graduate assistant', industry) ~ 'Information Services',
              grepl('It security', industry) ~ 'Information Services',
              grepl('lab', industry) ~ 'Information Services',            grepl('Print', industry) ~ 'Information Services', 
            grepl('Chemistry', industry) ~ 'Information Services',
            grepl('Gaming', industry) ~ 'Information Services',
            grepl('Software', industry) ~ 'Information Services',
            grepl('Journalism', industry) ~ 'Information Services',
            grepl('journalism', industry) ~ 'Information Services',
            grepl('Work-Study', industry) ~ 'Information Services',
            grepl('Geospatial', industry) ~ 'Information Services',
            grepl('Virtual reality', industry) ~ 'Information Services',
            grepl('Survey', industry) ~ 'Information Services',
            
            grepl('PR', industry) ~ 'Information Services',
            grepl('Quality Assurance', industry) ~ 'Information Services',
#Human Services              
              grepl('Gov', industry) ~ 'Human Services',
              grepl('gov', industry) ~ 'Human Services',
              grepl('Volun', industry) ~ 'Human Services',
              grepl('Policy', industry) ~ 'Human Services',
              grepl('City', industry) ~ 'Human Services',
              grepl('Federal', industry) ~ 'Human Services',
              grepl('federal', industry) ~ 'Human Services',
              grepl('State', industry) ~ 'Human Services',
              grepl('Grant', industry) ~ 'Human Services',

              grepl('Nonprofit', industry) ~ 'Human Services',
              grepl('Non-profit', industry) ~ 'Human Services',
              grepl('Workforce', industry) ~ 'Human Services',
              grepl('Social Work', industry) ~ 'Human Services',
              grepl('Fundrais', industry) ~ 'Human Services',
              grepl('Environ', industry) ~ 'Human Services',
              grepl('environ', industry) ~ 'Human Services',
              grepl('Natural', industry) ~ 'Human Services',
              grepl('not for profit', industry) ~ 'Human Services',
              grepl('Non Profit', industry) ~ 'Human Services',
              grepl('Non profit', industry) ~ 'Human Services',
              grepl('Military', industry) ~ 'Human Services',
              grepl('Ministry', industry) ~ 'Human Services',
              grepl('Clergy', industry) ~ 'Human Services',
              grepl('Chaplain', industry) ~ 'Human Services',
              grepl('Foundation', industry) ~ 'Human Services',
              grepl('Religio', industry) ~ 'Human Services',
              grepl('Heritage', industry) ~ 'Human Services',
              grepl('AmeriCorps', industry) ~ 'Human Services',
              grepl('Intelligence', industry) ~ 'Human Services',
              grepl('Museum', industry) ~ 'Human Services',
              grepl('Zoo', industry) ~ 'Human Services',
              grepl('Philanthropy', industry) ~ 'Human Services',
              grepl('International development', industry) ~ 'Human Services',
              grepl('Historic Preservation', industry) ~ 'Human Services',
              grepl('Maritime', industry) ~ 'Human Services',
              grepl('Church', industry) ~ 'Human Services',
              grepl('Association', industry) ~ 'Human Services',
              grepl('association', industry) ~ 'Human Services',
              grepl('Politic', industry) ~ 'Human Services',
              grepl('politic', industry) ~ 'Human Services',
              grepl('Defense', industry) ~ 'Human Services',
              grepl('union', industry) ~ 'Human Services',
              grepl('Union', industry) ~ 'Human Services',
              grepl('Labor', industry) ~ 'Human Services',

              grepl('Funding', industry) ~ 'Human Services',
              grepl('Performing Arts', industry) ~ 'Human Services',
              grepl('Hosp', industry) ~ 'Human Services',
              grepl('parent', industry) ~ 'Human Services',

#Service              
              grepl('Law', industry) ~ 'Service',
              grepl('Legal', industry) ~ 'Service',
              grepl('attorney', industry) ~ 'Service',
              grepl('Psychol', industry) ~ 'Service',
              grepl('Counseling', industry) ~ 'Service',
              grepl('Doc', industry) ~ 'Service',
              grepl('Forensics', industry) ~ 'Service',
              grepl('Toxicology', industry) ~ 'Service',
              grepl('Medical', industry) ~ 'Service',
              grepl('Manage', industry) ~ 'Service',
              grepl('manage', industry) ~ 'Service',
              grepl('Admin', industry) ~ 'Service',
              grepl('Reception', industry) ~ 'Service',
              grepl('design', industry) ~ 'Service',
              grepl('Sale', industry) ~ 'Service',
              grepl('sale', industry) ~ 'Service',
              grepl('Buyer', industry) ~ 'Service',
              grepl('Bank', industry) ~ 'Service',
              grepl('finance', industry) ~ 'Service',
              grepl('trading', industry) ~ 'Service',
              grepl('Archi', industry) ~ 'Service',
              grepl('archi', industry) ~ 'Service',
              grepl('product development', industry) ~ 'Service',
              grepl('Equipment', industry) ~ 'Service',
              grepl('Health', industry) ~ 'Service',
              grepl('health', industry) ~ 'Service',
              grepl('Retail', industry) ~ 'Service',
              grepl('Tour', industry) ~ 'Service',
              grepl('Screening', industry) ~ 'Service',
              grepl('Insur', industry) ~ 'Service',
              grepl('Hospitality', industry) ~ 'Service',
              grepl('Consult', industry) ~ 'Service',
              grepl('consult', industry) ~ 'Service',
              grepl('Rideshare', industry) ~ 'Service',
              grepl('Transp', industry) ~ 'Service',
              grepl('Logist', industry) ~ 'Service',
              grepl('Distribution', industry) ~ 'Service',
              grepl('Deliver', industry) ~ 'Service',
              grepl('deliver', industry) ~ 'Service',
              grepl('sustainability', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('Planning', industry) ~ 'Service',      
              grepl('Recruit', industry) ~ 'Service',
              grepl('HR', industry) ~ 'Service',
              grepl('Equity', industry) ~ 'Service',
              grepl('Human', industry) ~ 'Service',
              grepl('Childcare', industry) ~ 'Service',
              grepl('Child care', industry) ~ 'Service',
              grepl('Child Care', industry) ~ 'Service',
              grepl('Telecom', industry) ~ 'Service',
              grepl('parking', industry) ~ 'Service',
              grepl('restoration', industry) ~ 'Service',
              grepl('Homemaker', industry) ~ 'Service',
              grepl('Caregiver', industry) ~ 'Service',

              grepl('Landscap', industry) ~ 'Service',
              grepl('landscap', industry) ~ 'Service',
              grepl('Repair', industry) ~ 'Service',
              grepl('repair', industry) ~ 'Service',
              grepl('Facilities', industry) ~ 'Service',

              grepl('engraving', industry) ~ 'Service',
              grepl('Call center', industry) ~ 'Service',
              grepl('Wholesale', industry) ~ 'Service',
              grepl('wholesale', industry) ~ 'Service',
              grepl('CALL', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('pharma', industry) ~ 'Service',
              grepl('Regulator', industry) ~ 'Service',
              grepl('Entertain', industry) ~ 'Service',
              grepl('Sport', industry) ~ 'Service',
              grepl('Music', industry) ~ 'Service',
              grepl('music', industry) ~ 'Service',
              grepl('I work for Indeed.com', industry) ~ 'Service',
              grepl('Interpret', industry) ~ 'Service',
              grepl('Funeral', industry) ~ 'Service',
              grepl('funeral', industry) ~ 'Service',
              grepl('Translation', industry) ~ 'Service',
              grepl('Concrete', industry) ~ 'Service',
              grepl('ecomm', industry) ~ 'Service',
              grepl('E-comm', industry) ~ 'Service',
              grepl('E-Comm', industry) ~ 'Service',
              grepl('e-comm', industry) ~ 'Service',
              grepl('E comm', industry) ~ 'Service',
              grepl('EComm', industry) ~ 'Service',
              grepl('Ecomm', industry) ~ 'Service',
              grepl('Pest', industry) ~ 'Service',
              grepl('Service', industry) ~ 'Service',
              grepl('service', industry) ~ 'Service',
              grepl('Cleaning', industry) ~ 'Service',
              grepl('Saas', industry) ~ 'Service',
              grepl('compli', industry) ~ 'Service',
              grepl('Assisting', industry) ~ 'Service',

              grepl('Apparel', industry) ~ 'Service',
              grepl('appraisal', industry) ~ 'Service',
              grepl('Compli', industry) ~ 'Service',
              grepl('Business', industry) ~ 'Service',
              grepl('Strategy', industry) ~ 'Service',
              grepl('Eap', industry) ~ 'Service',
              grepl('Investing', industry) ~ 'Service',
              grepl('Storage', industry) ~ 'Service',
              grepl('Supply', industry) ~ 'Service',
              grepl('Procur', industry) ~ 'Service',
              grepl('Airline', industry) ~ 'Service',
              grepl('Aerospace', industry) ~ 'Service',
              grepl('aviat', industry) ~ 'Service',
              grepl('Aviat', industry) ~ 'Service',
              grepl('Importing', industry) ~ 'Service',
              grepl('Staffing', industry) ~ 'Service',
              grepl('Cannabis', industry) ~ 'Service',
              grepl('warehous', industry) ~ 'Service',
              grepl('Warehous', industry) ~ 'Service',
              grepl('Rental', industry) ~ 'Service',
              grepl('Food', industry) ~ 'Service',
              grepl('food', industry) ~ 'Service',
              grepl('cook', industry) ~ 'Service',
              grepl('Beauty', industry) ~ 'Service',
              grepl('safety', industry) ~ 'Service',
              grepl('protect', industry) ~ 'Service',
  
              grepl('Operations', industry) ~ 'Service',
              grepl('Real Estate', industry) ~ 'Service',
              grepl('Real estate', industry) ~ 'Service',
              grepl('real estate', industry) ~ 'Service',
              grepl('Fitness', industry) ~ 'Service',
              grepl('Gym', industry) ~ 'Service',
              grepl('accessibility', industry) ~ 'Service',
              grepl('Accessibility', industry) ~ 'Service',
              grepl('Executive', industry) ~ 'Service',
              grepl('fitness', industry) ~ 'Service',
              grepl('Security', industry) ~ 'Service',
              grepl('Per Sit', industry) ~ 'Service',
              grepl('Pet', industry) ~ 'Service',
              grepl('Vet', industry) ~ 'Service',
              grepl('Animal', industry) ~ 'Service',
              grepl('restaurant', industry) ~ 'Service',
              grepl('Restaurant', industry) ~ 'Service',
              grepl('Auction', industry) ~ 'Service',
              grepl('Mortgage', industry) ~ 'Service',
              grepl('mortgage', industry) ~ 'Service',
              grepl('Tax', industry) ~ 'Service',
              grepl('Film', industry) ~ 'Service',

              grepl('Travel', industry) ~ 'Service',
              grepl('Gambl', industry) ~ 'Service',
              grepl('office', industry) ~ 'Service',
              grepl('Office', industry) ~ 'Service',
              grepl('recycling', industry) ~ 'Service',
              grepl('Fashion', industry) ~ 'Service',
              grepl('Finance', industry) ~ 'Service',
              grepl('coach', industry) ~ 'Service',
              grepl('Corporate', industry) ~ 'Service',
              grepl('camp', industry) ~ 'Service',            
              grepl('Cosmetology', industry) ~ 'Service',
              grepl('Purchasing', industry) ~ 'Service',
              grepl('Printing', industry) ~ 'Service',
              grepl('Global Mobility', industry) ~ 'Service',
              grepl('professional', industry) ~ 'Service',
              grepl('Horticulture', industry) ~ 'Service',
              grepl('Maintenance', industry) ~ 'Service',
              grepl('Janitorial', industry) ~ 'Service'
    ))

Once complete, we can see that Information Services has the largest number of responses, and very few were from the Raw Materials Industry Group.

table(sal_US1$indust_grp)
## 
##                 Human Services           Information Services 
##                           4166                           9083 
## Manufacturing and Construction                  Raw Materials 
##                           1876                            169 
##                        Service 
##                           7523

There are 58 rows left with NA in industry and therefore in indust_group. We will remove those rows, and then remove industry, leaving just the indust_group column.

We now have 22,817 rows and 17 variables.

sal_US1 <- sal_US1[!is.na(sal_US1$indust_grp),]
sal_US1 = subset(sal_US1, select = -c(industry))

Context fields

We will turn both context fields, title_context and income_context into flags (0,1) if there is additional context. The values in these fields are too varied to be of any use in categorizing any other way. But it is possible a response where it was necessary to add context might impact total_comp one way or another. We will then remove the original fields, leaving the same number of variables, 17.

#create the title_context_flg field
sal_US1 <- sal_US1 |>
  mutate(title_context_flg=if_else(!is.na(title_context),1,0))
sal_US1$title_context_flg <- as_factor(sal_US1$title_context_flg)
#create the income_context_flg field
sal_US1 <- sal_US1 |>
  mutate(income_context_flg=if_else(!is.na(income_context),1,0))
sal_US1$income_context_flg <- as_factor(sal_US1$income_context_flg)

sal_US1 = subset(sal_US1, select = -c(income_context, title_context))

As we can see, we have 5,974 rows with title_context_flg=1

table(sal_US1$title_context_flg)
## 
##     0     1 
## 16843  5974

And 2,475 rows with income_comtext_flg=1

table(sal_US1$income_context_flg)
## 
##     0     1 
## 20342  2475

Demographic Columns

There are a few rows where race, edu and gender are all NA. These will be removed. This leave 22,799 rows.

sal_US1<-sal_US1[rowSums(is.na(sal_US1[, c("race", "edu", "gender")])) < 3,]

Next, we want to combine “Other or prefer not..” + “Prefer not..” + NAs in gender, leaving no NAs and a more organized set of values.

sal_US1 <- sal_US1 |>
  mutate(gender=if_else(grepl('refer', gender) | is.na(gender),"Prefer not to answer",gender))

unique(sal_US1$gender)
## [1] "Man"                  "Woman"                "Non-binary"          
## [4] "Prefer not to answer"

Similarly, we want to combine “Prefer not..” + NAs in edu, leaving no NAs and a more organized set of values.

sal_US1 <- sal_US1 |>
  mutate(edu=if_else(is.na(edu),"Prefer not to answer",edu))

unique(sal_US1$edu)
## [1] "College degree"                     "Master's degree"                   
## [3] "Professional degree (MD, JD, etc.)" "High School"                       
## [5] "Some college"                       "PhD"                               
## [7] "Prefer not to answer"

On race, we combine “Another or prefer not..” + NAs. Since this is a pick all that apply field we then must make a set of flags to indicate indicate if one of the 7 choices was / was not selected. Once those fields are created, we remove the original multiselect race column.

sal_US1 <- sal_US1 |>
  mutate(race=if_else(is.na(race),"Another or prefer not to answer",race))

sal_US1 <- sal_US1 |>
  mutate(White=if_else(grepl('White', race),1,0)) |>
  mutate(AsianOrAsianAm=if_else(grepl('Asian', race),1,0)) |>
  mutate(NatAmOrAlaskaNat=if_else(grepl('Native', race),1,0)) |>
  mutate(BlackOrAficanAm=if_else(grepl('Black', race),1,0)) |>
  mutate(MiddleEorNAfrican=if_else(grepl('Middle', race),1,0))|>
  mutate(PreferNot=if_else(grepl('prefer', race),1,0)) |>
  mutate(HispLatSp=if_else(grepl('Hisp', race),1,0))

sal_US1 = subset(sal_US1, select = -c(race))

Since we were not able to look at these fields in the Target Flag Breakdown section of the Data Exploration, we will do so here.

##NEW dmp 12/2
p10<-ggplot(filter(sal_US1, White==1), 
            aes(fill=hi_comp, x=White)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank()) 

p11<-ggplot(filter(sal_US1, AsianOrAsianAm==1), 
            aes(fill=hi_comp, x=AsianOrAsianAm)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank()) 

p12<-ggplot(filter(sal_US1, NatAmOrAlaskaNat==1), 
            aes(fill=hi_comp, x=NatAmOrAlaskaNat)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p13<-ggplot(filter(sal_US1, BlackOrAficanAm==1), 
            aes(fill=hi_comp, x=BlackOrAficanAm)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p14<-ggplot(filter(sal_US1, MiddleEorNAfrican==1), 
            aes(fill=hi_comp, x=MiddleEorNAfrican)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p15<-ggplot(filter(sal_US1, HispLatSp==1), 
            aes(fill=hi_comp, x=HispLatSp)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p16<-ggplot(filter(sal_US1, PreferNot==1), 
            aes(fill=hi_comp, x=PreferNot)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          axis.text.y=element_blank())
ggarrange(p10,p11,p12,p13,p14,p15,p16, nrow=3, ncol = 3)

add_comp

Much of add_comp is NA, as not everyone gets compensation beyond salary. We have captured the added amount in total_comp (annual_sal+add_comp). But it may be of interest if add_comp is received or not. So, here again, we create a flag to indicate if add_comp was populated (1) or not (0). Then we remove the original field.

#create the add_comp_flg field
sal_US1 <- sal_US1 |>
  mutate(add_comp_flg=if_else(!is.na(add_comp),1,0))
sal_US1$add_comp_flg <- as_factor(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(add_comp))

As we can see, we have 17,110 rows with add_comp_flg=1

table(sal_US1$add_comp_flg)
## 
##     0     1 
##  5689 17110

Job Title

This is likely one of the most powerful variables we have for predicting our response. As it is widely known that the type of job you have is directly related to how much income you earn. The problem with using this field in regression models is that we right now only have a character vector column with a multitude of different responses that can not be reasonably factorized as is.

Thus, we will attempt to do some natural language processing to extract a bit of the essence of each job title from the row.

library(tidytext)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
## 
##     tokenize
# We begin by adding some common stop words that do not have any value from our dataset into an extended stop word dictionary.

stop_words %>% 
  add_row(word= as.character(1:5), lexicon = "group_5") %>%
  add_row(word= c("i","I","ii","II","iii","III"), lexicon = "group_5") -> stop_words

# Then we tokenize the job titles in order to be able to remove the stop words and further process.

text_df <- tibble(line = 1:length(sal_US1$job_title), text = sal_US1$job_title) %>% 
  unnest_tokens(word,text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
# We take the lemmatization of each word used in job title to better standarize the meaning behind the job titles 

text_df$word <- lemmatize_words(text_df$word)

# Below we find those words which occur more than 50 times in order to use that as our singular indicator instead of job title. We keep only the lowest count word for each row of the job title column in order to maximize variability for prediction.

text_df %>% 
  count(word, sort = TRUE) %>% 
  filter(n >= 50) -> top_role

text_df %>% 
  inner_join(top_role, join_by(word == word)) %>% 
  arrange(n) %>% 
  distinct(line,.keep_all = TRUE) -> text_df

sal_US1 %>%
  mutate(Rownumber = row_number()) %>% 
  left_join(text_df, join_by(Rownumber == line )) %>%
  mutate(job_title = word) %>% 
  mutate(job_title = if_else(is.na(job_title), "other", job_title)) %>% 
  select(-Rownumber, -word, -n) -> sal_US1

Timestap

Timestamp is just the datetime of the response so we will remove it.

#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp))

We now have 22,799 rows with 21 variables which we have either cleaned or created.

Outliers

To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.

We will create flags for each variable to identify potential outliers.

Using this method, we created:
- annual_salOUT: which shows we have 891 (4%) potential outliers in annual_sal - total_compOUT: which shows 1,250 (5%) potential Outliers in total_comp

The table below shows the resulting values (TRUE is a potential outlier).

#Outliers
#create new variable to flag potential outliers
sal_US$annual_salOUT<-scores(sal_US$annual_sal, type='iqr', lim=1.5)

sal_US$total_compOUT<-scores(sal_US$total_comp, type='iqr', lim=1.5)

#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(sal_US[grep('OUT',names(sal_US))]))
##       annual_salOUT total_compOUT
## FALSE         21998         21639
## TRUE            891          1250

Correlation

Since most of the data is categorical, we can use a correlation funnel to look for correlation in our data.

# Correlation Funnel
sal_US1 %>%
  na.omit() |>
  binarize(n_bins = 4, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) |> 
  correlate(hi_comp__1) |> 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)

Looking at the correlation for job titles we can see that there are some job titles that are slightly correlated with total compensation and high compensation, however the vast majority of job_titles have little to no correlation with either. Thus, we segment our job titles into these with the five highest magnitude of correlation and an other class. This will lessen the compute requirements on any models that we build while not losing much salient information.

sal_US1 %>% 
  mutate(job_title = if_else(job_title %in% c("engineer","software","administrative","director","librarian"), job_title, "other")) -> sal_US1

sal_US1 %>% 
  pull(job_title) %>% 
  unique()
## [1] "engineer"       "other"          "librarian"      "director"      
## [5] "software"       "administrative"

Build Models

With our data processed and explored we are ready to begin the exercise of building our models. We will have three distinct models that we want to build: an elastic net model, an xgboost model, and a simple neural network model in order to attempt to predict if a given worker gets above or below median compensation for their state. In practice this would allow for workers to be able to determine factors that contribute to being paid above or below the median state income and allow businesses to determine if they should be paying a work above or below median state income based on their qualifications.

We’ll begin by splitting our processed data into train and test sets. While doing this split, we’ll want to take into consideration that those salaries with a lower compensation are less representative in our dataset. If we just leave the data as is and split based on the existing class distribution then we’ll have any of our models more likely to predict the higher salary class simply for the fact that it is more representative. So, we will undersample the majority class by randomly discarding rows from it until the classes have an even representation.

Before splitting our data we also take the opportunity to relevel some of our categorical variables. We do this in order to create an elastic net regression model equation that has a more sensible baseline rather than using random factors as the base level. For example, for gender and education we have the baseline set to non-answers. So, we can see the coefficients of male,female, and non-binary in order to intuit how important they are towards having a higher income. We also have our reference level of compensation set to lower compensation as this is the majority class and we want our coefficients to inform us on what leads to having a higher income rather than the other way around.

set.seed(1234)

sal_US2 <- sal_US1 %>% 
  mutate(
    hi_comp = relevel(as_factor(if_else(hi_comp == 1, "high","low")), ref = "low"),
    state = relevel(as_factor(state), ref="District of Columbia"),
    job_title = relevel(as_factor(job_title), ref="other"),
    edu = relevel(as_factor(edu), ref="Prefer not to answer"),
    gender = relevel(as_factor(gender), ref="Prefer not to answer"),
    indust_grp = relevel(as_factor(indust_grp), ref="Service")
  ) %>% 
  select(-c(annual_sal, PIncPerCap, total_comp, add_comp_flg))

train <- createDataPartition(sal_US2$hi_comp,
                             p = 0.7,
                             list = FALSE)

sal_train <- sal_US2[ train,] 
sal_test  <- sal_US2[-train,]

dim(sal_train)
## [1] 15961    18
sal_train_bal <- UBL::RandUnderClassif(hi_comp ~ ., sal_train)

sal_train_bal %>% 
  pull(hi_comp) %>% 
  table()
## .
##  low high 
## 5920 5920

Elastic Net Logit Regression

Our first model is an elastic net logistic regression model, this is essentially an extension of the logistic model form of a generalized linear model. Elastic net adds the lambda and alpha parameters for regularization and feature selection of our data. It is essentially a combination of lasso and ridge regression, the lambda parameter essentially selects terms while the alpha parameter keeps coefficients small in magnitude. We utilize a grid search on 5 different values of alpha and lambda to fit the best elastic net model to our cross-validated and balanced training dataset.

However, we end up with a surprise in that the optimal values seem to just be not using regularization. Our literature review did hint that we would not get much out of regularization if we didn’t have an extremely large number of quantitative variables, but our training ROC ends up improving from our generalized linear model with an almost even sensitivity and specificity.

We end up with a training ROC of 0.77 that has a nice mix of sensitivity and specificity. Although, surprisingly our optimal lambda hyper parameter is 0. Which means that there is no shrinkage of coefficients happening while an alpha of 0.4 means that there is a 40% degree of mixing with ridge and lasso regression in favor of ridge. Since there is no regularization actually happening when lambda is 0, our model is actually mimicking a logistic regression model.

ctrl <- trainControl(
  method = "cv", 
  number = 5,
  summaryFunction = twoClassSummary, 
  classProbs = TRUE, 
  savePredictions = T
)

elm <- train(hi_comp ~ ., data = sal_train_bal, 
                          method = 'glmnet',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC",
                          tuneGrid=expand.grid(
                          .alpha=seq(0, 1, by = 0.2),
                          .lambda=seq(0, 0.1, by = 0.02)))

elm
## glmnet 
## 
## 11840 samples
##    17 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (97), scaled (97) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  ROC        Sens       Spec     
##   0.0    0.00    0.7705630  0.7028716  0.6871622
##   0.0    0.02    0.7703210  0.7040541  0.6875000
##   0.0    0.04    0.7694510  0.7011824  0.6888514
##   0.0    0.06    0.7684800  0.7008446  0.6888514
##   0.0    0.08    0.7675477  0.6988176  0.6900338
##   0.0    0.10    0.7665949  0.6974662  0.6896959
##   0.2    0.00    0.7707516  0.7042230  0.6875000
##   0.2    0.02    0.7696437  0.7084459  0.6815878
##   0.2    0.04    0.7655518  0.7015203  0.6824324
##   0.2    0.06    0.7615253  0.6896959  0.6856419
##   0.2    0.08    0.7564748  0.6819257  0.6866554
##   0.2    0.10    0.7509472  0.6785473  0.6820946
##   0.4    0.00    0.7709807  0.7037162  0.6871622
##   0.4    0.02    0.7664894  0.7054054  0.6798986
##   0.4    0.04    0.7584002  0.6910473  0.6831081
##   0.4    0.06    0.7480442  0.6834459  0.6790541
##   0.4    0.08    0.7378040  0.6707770  0.6728041
##   0.4    0.10    0.7258753  0.6695946  0.6567568
##   0.6    0.00    0.7707746  0.7037162  0.6873311
##   0.6    0.02    0.7632082  0.6971284  0.6817568
##   0.6    0.04    0.7484730  0.6873311  0.6738176
##   0.6    0.06    0.7322804  0.6748311  0.6641892
##   0.6    0.08    0.7109415  0.6711149  0.6489865
##   0.6    0.10    0.6932197  0.6917230  0.5974662
##   0.8    0.00    0.7707458  0.7040541  0.6878378
##   0.8    0.02    0.7591363  0.6962838  0.6761824
##   0.8    0.04    0.7379868  0.6797297  0.6685811
##   0.8    0.06    0.7120383  0.6753378  0.6432432
##   0.8    0.08    0.6871049  0.6935811  0.5881757
##   0.8    0.10    0.6736190  0.6630068  0.5998311
##   1.0    0.00    0.7706906  0.7042230  0.6878378
##   1.0    0.02    0.7543794  0.6939189  0.6733108
##   1.0    0.04    0.7263351  0.6800676  0.6528716
##   1.0    0.06    0.6929914  0.7005068  0.5849662
##   1.0    0.08    0.6736190  0.6630068  0.5998311
##   1.0    0.10    0.6632438  0.6630068  0.5998311
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.4 and lambda = 0.

Taking a look at the plot of ROC based on alpha and lambda we see that any attempt of regularization actually hurts the model, likely because the amount and magnitude of our already scaled variables would not benefit from regularization. All of our most optimal ROCs are when there is no regularization happening at all at an alpha of 0, and thus any difference between those fits in ROC is likely due to rounding errors from averaging cross-validation.

plot(elm)

Looking at our variable importance and combining it with our set of model coefficients, we can see what is most important to being above median income in your state. Years of field experience is far and away the biggest positive contributor with having software somewhere in your job title trailing a little behind. We can see that there are many states such as California or Texas which lend well to having a higher than median income. It should be noted that there are negative predictors like being in the human services industry that quite harshly lower your chances of being above median income.

plot(varImp(elm), top=20)

We have the coefficients for the model in a dataframe below that show insights like those who identify as a man have a 0.19 log odds higher chance to be above median income for their state compared to those who do not disclose their gender. Alternatively, this is 0.25 log odds higher than those who identify as women and 0.34 log odds higher than those who identify as non-binary. A somewhat concerning sign in regards to gender equality.

lambda_use <- min(elm$finalModel$lambda[elm$finalModel$lambda >= elm$bestTune$lambda])
position <- which(elm$finalModel$lambda == lambda_use)
data.frame(coef(elm$finalModel)[, position])
##                                          coef.elm.finalModel....position.
## (Intercept)                                                  0.0272122856
## stateAlabama                                                 0.0342922241
## stateAlaska                                                  0.0025617741
## stateArizona                                                 0.0584637516
## stateArkansas                                                0.0378464302
## stateCalifornia                                              0.2703947400
## stateColorado                                               -0.0100258581
## stateConnecticut                                            -0.0007592963
## stateDelaware                                                0.0464858877
## stateFlorida                                                 0.0224573180
## stateGeorgia                                                 0.1449913864
## stateHawaii                                                  0.0262173568
## stateIdaho                                                   0.0524035885
## stateIllinois                                                0.1276316142
## stateIndiana                                                 0.0596260450
## stateIowa                                                    0.0133257667
## stateKansas                                                 -0.0277506708
## stateKentucky                                                0.0703749017
## stateLouisiana                                               0.0701187939
## stateMaine                                                  -0.0025375279
## stateMaryland                                                0.0684867472
## stateMassachusetts                                          -0.0058393672
## stateMichigan                                                0.0867681462
## stateMinnesota                                               0.0179198517
## stateMississippi                                             0.0088917197
## stateMissouri                                                0.0655163964
## stateMontana                                                -0.0161305880
## stateNebraska                                               -0.0421310194
## stateNevada                                                  0.0361507808
## stateNew Hampshire                                          -0.0382528976
## stateNew Jersey                                              0.0130526236
## stateNew Mexico                                              0.0593509182
## stateNew York                                                0.1396122567
## stateNorth Carolina                                          0.1030563539
## stateNorth Dakota                                           -0.0220905002
## stateOhio                                                    0.1012446442
## stateOklahoma                                               -0.0064275534
## stateOregon                                                  0.1697867279
## statePennsylvania                                            0.0753171886
## stateRhode Island                                            0.0171608265
## stateSouth Carolina                                          0.0613509087
## stateSouth Dakota                                           -0.0372967240
## stateTennessee                                               0.0277562600
## stateTexas                                                   0.1762924332
## stateUtah                                                    0.0821316509
## stateVermont                                                -0.0159693464
## stateVirginia                                                0.1193362667
## stateWashington                                              0.1105400661
## stateWest Virginia                                           0.0092381546
## stateWisconsin                                               0.0217328110
## stateWyoming                                                -0.0256990642
## age.L                                                        0.0308285558
## age.Q                                                       -0.0388525698
## age.C                                                        0.0287249806
## age^4                                                        0.1460006692
## age^5                                                       -0.0209193656
## age^6                                                        0.0510197707
## job_titleengineer                                            0.2660326673
## job_titlelibrarian                                          -0.1724378498
## job_titledirector                                            0.1159608471
## job_titlesoftware                                            0.4826346287
## job_titleadministrative                                     -0.2096524941
## yrs_wrk_exp.L                                               -0.0004006444
## yrs_wrk_exp.Q                                               -0.0480978159
## yrs_wrk_exp.C                                               -0.1049689487
## yrs_wrk_exp^4                                               -0.0069869400
## yrs_wrk_exp^5                                               -0.0475591711
## yrs_wrk_exp^6                                               -0.0100991098
## yrs_wrk_exp^7                                               -0.0019324193
## yrs_in_field_exp.L                                           0.5655205873
## yrs_in_field_exp.Q                                          -0.0870941441
## yrs_in_field_exp.C                                          -0.0226095680
## yrs_in_field_exp^4                                          -0.0180127128
## yrs_in_field_exp^5                                          -0.0056740656
## yrs_in_field_exp^6                                          -0.0171730190
## yrs_in_field_exp^7                                           0.0000000000
## eduCollege degree                                           -0.0265000749
## eduMaster's degree                                           0.1338599878
## eduProfessional degree (MD, JD, etc.)                        0.3258394885
## eduHigh School                                              -0.1442946244
## eduSome college                                             -0.2554382761
## eduPhD                                                       0.2243281920
## genderMan                                                    0.1887704368
## genderWoman                                                 -0.0587417266
## genderNon-binary                                            -0.1546615526
## indust_grpInformation Services                              -0.1108176735
## indust_grpManufacturing and Construction                     0.1500097958
## indust_grpHuman Services                                    -0.3336463870
## indust_grpRaw Materials                                     -0.0377259821
## title_context_flg1                                          -0.0981067782
## income_context_flg1                                          0.0222107868
## White                                                       -0.0867741771
## AsianOrAsianAm                                               0.0901848422
## NatAmOrAlaskaNat                                             0.0042052294
## BlackOrAficanAm                                              0.0052612611
## MiddleEorNAfrican                                            0.0069245255
## PreferNot                                                   -0.0080533652
## HispLatSp                                                   -0.0585251065

Now we take a look at some evaluation metrics for our model and see that for our first model we get a fairly decent accuracy of 70.3%, and by balancing the dataset we do not have a big disparity between which classes are being predicted either. This sets a nice baseline accuracy that our other models will need to beat.

(prelm <- elm |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1826 1318
##       high  710 2984
##                                                
##                Accuracy : 0.7034               
##                  95% CI : (0.6924, 0.7142)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3943               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.7200               
##             Specificity : 0.6936               
##          Pos Pred Value : 0.5808               
##          Neg Pred Value : 0.8078               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2670               
##    Detection Prevalence : 0.4598               
##       Balanced Accuracy : 0.7068               
##                                                
##        'Positive' Class : low                  
## 
evalm <- as_tibble(rbind(c("Model"="Elastic Net Logit",prelm$overall["Accuracy"],prelm$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))

XGBoost Decision Tree Ensemble

Traditionally decision tree-based models (specifically ensembles) can end up beating a logistic regression model for classifying a similar set of data when you are not utilizing the appropriate terms within your data for the logistic regression. Thus, it made sense to train a tree-based model and compare it to our logistic regression model. XGBoost is an ensemble of multiple decision-trees that slowly builds itself up through repeated iterations of tree generations. It is typically used as the gold standard for predicting with tree-based models. Which is why we chose it.

We utilize the same terms used for the previous model and preprocessing for the balanced data set. We allow caret to automatically tune the xgboost model through its default parameter grid. Although if we chose to we would be able to customize the learning rate, amount of iterations per xgboost, the max amount of nodes per tree, and how the weights changed each new tree.

train_control <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE,
  allowParallel = FALSE 
)

xgb <- train(hi_comp ~ ., data = sal_train_bal, 
                          method = 'xgbTree',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC",
                          verbosity = 0)

xgb
## eXtreme Gradient Boosting 
## 
## 11840 samples
##    17 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (97), scaled (97) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  ROC        Sens     
##   0.3  1          0.6               0.50        50      0.7572538  0.6984797
##   0.3  1          0.6               0.50       100      0.7664798  0.7032095
##   0.3  1          0.6               0.50       150      0.7687499  0.7015203
##   0.3  1          0.6               0.75        50      0.7567168  0.6912162
##   0.3  1          0.6               0.75       100      0.7653855  0.7025338
##   0.3  1          0.6               0.75       150      0.7678028  0.7028716
##   0.3  1          0.6               1.00        50      0.7550485  0.6935811
##   0.3  1          0.6               1.00       100      0.7644497  0.7045608
##   0.3  1          0.6               1.00       150      0.7673589  0.7076014
##   0.3  1          0.8               0.50        50      0.7579428  0.6925676
##   0.3  1          0.8               0.50       100      0.7660825  0.7016892
##   0.3  1          0.8               0.50       150      0.7679894  0.7033784
##   0.3  1          0.8               0.75        50      0.7561491  0.6902027
##   0.3  1          0.8               0.75       100      0.7656428  0.7035473
##   0.3  1          0.8               0.75       150      0.7683460  0.7035473
##   0.3  1          0.8               1.00        50      0.7554777  0.6908784
##   0.3  1          0.8               1.00       100      0.7641892  0.7048986
##   0.3  1          0.8               1.00       150      0.7672695  0.7050676
##   0.3  2          0.6               0.50        50      0.7713045  0.6998311
##   0.3  2          0.6               0.50       100      0.7759413  0.7027027
##   0.3  2          0.6               0.50       150      0.7780119  0.7067568
##   0.3  2          0.6               0.75        50      0.7684107  0.6989865
##   0.3  2          0.6               0.75       100      0.7757136  0.7027027
##   0.3  2          0.6               0.75       150      0.7784199  0.7032095
##   0.3  2          0.6               1.00        50      0.7685914  0.6954392
##   0.3  2          0.6               1.00       100      0.7748471  0.7020270
##   0.3  2          0.6               1.00       150      0.7776191  0.7005068
##   0.3  2          0.8               0.50        50      0.7705557  0.6900338
##   0.3  2          0.8               0.50       100      0.7760780  0.7008446
##   0.3  2          0.8               0.50       150      0.7772938  0.7033784
##   0.3  2          0.8               0.75        50      0.7708489  0.7015203
##   0.3  2          0.8               0.75       100      0.7775871  0.7050676
##   0.3  2          0.8               0.75       150      0.7798438  0.7077703
##   0.3  2          0.8               1.00        50      0.7689553  0.6940878
##   0.3  2          0.8               1.00       100      0.7748025  0.6998311
##   0.3  2          0.8               1.00       150      0.7770959  0.7000000
##   0.3  3          0.6               0.50        50      0.7748033  0.6969595
##   0.3  3          0.6               0.50       100      0.7774157  0.6998311
##   0.3  3          0.6               0.50       150      0.7756935  0.6971284
##   0.3  3          0.6               0.75        50      0.7751786  0.6964527
##   0.3  3          0.6               0.75       100      0.7798733  0.6988176
##   0.3  3          0.6               0.75       150      0.7780683  0.6984797
##   0.3  3          0.6               1.00        50      0.7745554  0.6972973
##   0.3  3          0.6               1.00       100      0.7778087  0.7001689
##   0.3  3          0.6               1.00       150      0.7791723  0.7025338
##   0.3  3          0.8               0.50        50      0.7733890  0.6947635
##   0.3  3          0.8               0.50       100      0.7755043  0.7016892
##   0.3  3          0.8               0.50       150      0.7726986  0.6962838
##   0.3  3          0.8               0.75        50      0.7735473  0.6956081
##   0.3  3          0.8               0.75       100      0.7767860  0.6989865
##   0.3  3          0.8               0.75       150      0.7770633  0.6988176
##   0.3  3          0.8               1.00        50      0.7736951  0.6940878
##   0.3  3          0.8               1.00       100      0.7786515  0.6988176
##   0.3  3          0.8               1.00       150      0.7791159  0.7025338
##   0.4  1          0.6               0.50        50      0.7588348  0.6996622
##   0.4  1          0.6               0.50       100      0.7667285  0.7000000
##   0.4  1          0.6               0.50       150      0.7678259  0.7028716
##   0.4  1          0.6               0.75        50      0.7597993  0.6988176
##   0.4  1          0.6               0.75       100      0.7665021  0.7072635
##   0.4  1          0.6               0.75       150      0.7691458  0.7016892
##   0.4  1          0.6               1.00        50      0.7597044  0.6974662
##   0.4  1          0.6               1.00       100      0.7662533  0.7050676
##   0.4  1          0.6               1.00       150      0.7682276  0.7040541
##   0.4  1          0.8               0.50        50      0.7615122  0.7013514
##   0.4  1          0.8               0.50       100      0.7676843  0.7028716
##   0.4  1          0.8               0.50       150      0.7689627  0.7015203
##   0.4  1          0.8               0.75        50      0.7598068  0.7025338
##   0.4  1          0.8               0.75       100      0.7678722  0.7062500
##   0.4  1          0.8               0.75       150      0.7689141  0.7040541
##   0.4  1          0.8               1.00        50      0.7586600  0.6991554
##   0.4  1          0.8               1.00       100      0.7663634  0.7055743
##   0.4  1          0.8               1.00       150      0.7682460  0.7047297
##   0.4  2          0.6               0.50        50      0.7709094  0.6912162
##   0.4  2          0.6               0.50       100      0.7745954  0.7011824
##   0.4  2          0.6               0.50       150      0.7744919  0.6915541
##   0.4  2          0.6               0.75        50      0.7706695  0.6944257
##   0.4  2          0.6               0.75       100      0.7761942  0.6998311
##   0.4  2          0.6               0.75       150      0.7787126  0.7016892
##   0.4  2          0.6               1.00        50      0.7722029  0.6971284
##   0.4  2          0.6               1.00       100      0.7765141  0.7057432
##   0.4  2          0.6               1.00       150      0.7790404  0.7077703
##   0.4  2          0.8               0.50        50      0.7701142  0.6959459
##   0.4  2          0.8               0.50       100      0.7731955  0.7076014
##   0.4  2          0.8               0.50       150      0.7743636  0.7010135
##   0.4  2          0.8               0.75        50      0.7727503  0.7023649
##   0.4  2          0.8               0.75       100      0.7780400  0.7047297
##   0.4  2          0.8               0.75       150      0.7782493  0.7072635
##   0.4  2          0.8               1.00        50      0.7717029  0.6935811
##   0.4  2          0.8               1.00       100      0.7785561  0.7021959
##   0.4  2          0.8               1.00       150      0.7797763  0.7050676
##   0.4  3          0.6               0.50        50      0.7732348  0.7010135
##   0.4  3          0.6               0.50       100      0.7749693  0.7037162
##   0.4  3          0.6               0.50       150      0.7740255  0.7062500
##   0.4  3          0.6               0.75        50      0.7740169  0.6978041
##   0.4  3          0.6               0.75       100      0.7754728  0.6978041
##   0.4  3          0.6               0.75       150      0.7757586  0.6952703
##   0.4  3          0.6               1.00        50      0.7748298  0.6993243
##   0.4  3          0.6               1.00       100      0.7776663  0.6971284
##   0.4  3          0.6               1.00       150      0.7772504  0.7000000
##   0.4  3          0.8               0.50        50      0.7746342  0.6945946
##   0.4  3          0.8               0.50       100      0.7738472  0.6957770
##   0.4  3          0.8               0.50       150      0.7725517  0.6962838
##   0.4  3          0.8               0.75        50      0.7756496  0.6937500
##   0.4  3          0.8               0.75       100      0.7768373  0.7018581
##   0.4  3          0.8               0.75       150      0.7747284  0.7023649
##   0.4  3          0.8               1.00        50      0.7733082  0.6930743
##   0.4  3          0.8               1.00       100      0.7780589  0.6996622
##   0.4  3          0.8               1.00       150      0.7771866  0.7011824
##   Spec     
##   0.6736486
##   0.6802365
##   0.6804054
##   0.6760135
##   0.6812500
##   0.6831081
##   0.6726351
##   0.6766892
##   0.6788851
##   0.6777027
##   0.6773649
##   0.6800676
##   0.6809122
##   0.6783784
##   0.6817568
##   0.6756757
##   0.6775338
##   0.6802365
##   0.6923986
##   0.7042230
##   0.7045608
##   0.6895270
##   0.6981419
##   0.7005068
##   0.6971284
##   0.7016892
##   0.7065878
##   0.6979730
##   0.7045608
##   0.7064189
##   0.6942568
##   0.7059122
##   0.7025338
##   0.6927365
##   0.7033784
##   0.7060811
##   0.7052365
##   0.7028716
##   0.7043919
##   0.7064189
##   0.7109797
##   0.7087838
##   0.7030405
##   0.7010135
##   0.7016892
##   0.6993243
##   0.6952703
##   0.6978041
##   0.7005068
##   0.7042230
##   0.7033784
##   0.7035473
##   0.7048986
##   0.7052365
##   0.6760135
##   0.6831081
##   0.6890203
##   0.6785473
##   0.6805743
##   0.6837838
##   0.6782095
##   0.6765203
##   0.6822635
##   0.6792230
##   0.6858108
##   0.6859797
##   0.6748311
##   0.6778716
##   0.6853041
##   0.6766892
##   0.6798986
##   0.6810811
##   0.7042230
##   0.7038851
##   0.7060811
##   0.6961149
##   0.7023649
##   0.7077703
##   0.6940878
##   0.7047297
##   0.7027027
##   0.6971284
##   0.6937500
##   0.6976351
##   0.6964527
##   0.7030405
##   0.7005068
##   0.7016892
##   0.7070946
##   0.7045608
##   0.6996622
##   0.7016892
##   0.7013514
##   0.7045608
##   0.7055743
##   0.7006757
##   0.7016892
##   0.7065878
##   0.7043919
##   0.7052365
##   0.7045608
##   0.6994932
##   0.7076014
##   0.7018581
##   0.7025338
##   0.7033784
##   0.7015203
##   0.7013514
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 0.75.

We were able to find the highest training ROC yet with the optimally tuned parameters from our search at 0.78 which is an improvement from .

par(mfrow=c(2,2))
plot(xgb)

Visualizing how our ROC changes over iteration by hyper parameter is an interesting exercise that lets us see with our optimal parameters, the ROC increases linearly every iteration which is not the case for most other ROC developments over iteration.

plot(varImp(xgb), top=20)

Unfortunately due to the nature of XGBoost we can not see coefficients for each term, nor do we even have an equation in the first place. We are still able to see the variable importance for XGBoost and we see some interesting changes in magnitude. However, we can not see if these variables are negatively or positively contributing to chance to be above or below median income. The linear component of years in field experience is now almost four times the next most important variable of working in the human services industry. Yet the XGBoost importance graph is very similar to our elastic net linear model’s importance graph. A sign that it the data that’s extracted from these variables is the most important with the overall data that we have.

(x1c <- xgb |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1835 1265
##       high  701 3037
##                                                
##                Accuracy : 0.7125               
##                  95% CI : (0.7016, 0.7232)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4108               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.7236               
##             Specificity : 0.7060               
##          Pos Pred Value : 0.5919               
##          Neg Pred Value : 0.8125               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2684               
##    Detection Prevalence : 0.4533               
##       Balanced Accuracy : 0.7148               
##                                                
##        'Positive' Class : low                  
## 
evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="XGBoost",x1c$overall["Accuracy"],x1c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )

Now if we take a look at our predictions against the test set, we see an overall accuracy of 71.3%, which is a gain of 1.0% from our previous model. The 1.0% difference is not substantial but is an increase. Our sensitivity and specificity are still almost balanced 1:1 which is good for data that was originally unbalanced. This is our best model so far in everything but interpretability.

Neural Network

Our final model that we build is a neural network model from the nnet package. This neural network is only a basic feed forward neural network that has a singular hidden layer where a logistic activation function is used within the units. However, attempting to tune parameters for anything bigger is going to pose a problem from a computational standpoint and a time standpoint. Instead we work with tuning the parameters of size which are the amount of activation units in our singular hidden layers and decay which is a regularization parameter that penalizes overly large weights. Our tuning grid goes through 1 to 10 for size and 0.1 to 0.5 for decay. We also limit the maximum number of learning iterations to 250 due to time and computational constraints.

Even with the various limitations this neural network has, it still takes more than 2 hours to finish tuning. It’s because of this that we save the model as an .rds, so it does not need to be run again once knitting is required.

train_control <- trainControl(
  method = "cv",
  number = 5
)

grid <-  expand.grid(size = seq(from = 1, to = 10, by = 1),
                     decay = seq(from = 0.1, to = 0.5, by = 0.1))

nnet <- train(hi_comp ~ ., data = sal_train_bal, 
                          method = 'nnet',
                          tuneGrid = grid,
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = 'ROC',
                          maxit = 250,
                          verbose = FALSE,
                          trace = FALSE)
saveRDS(nnet, "nnet.rds")

Taking a look at the summary of our model we have a surprising result where our optimal number of units is a single unit. This means our model is essentially a singular logistic function. In fact when we look at our hyperparameter plot we see that our neural network does worse for each new unit added! I believe the reason for this is our limitation on having only 250 iterations, as the more units are present within a layer it takes more time to balance the weights in the gradient descent due to having to consider the gradient function for each weight and slowly change it. There’s additionally the fact that our data probably is not the most complex where having multiple units in a layer would actually help that much.

(nnet <- readRDS("nnet.rds"))
## Neural Network 
## 
## 11840 samples
##    17 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (97), scaled (97) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens       Spec     
##    1    0.1    0.7694272  0.6986486  0.6959459
##    1    0.2    0.7696311  0.7011824  0.6925676
##    1    0.3    0.7699531  0.7011824  0.6932432
##    1    0.4    0.7699417  0.7003378  0.6940878
##    1    0.5    0.7698871  0.7000000  0.6939189
##    2    0.1    0.7610215  0.6868243  0.6937500
##    2    0.2    0.7611409  0.6859797  0.7013514
##    2    0.3    0.7620972  0.6829392  0.7059122
##    2    0.4    0.7668394  0.7038851  0.6893581
##    2    0.5    0.7662787  0.6866554  0.7084459
##    3    0.1    0.7490994  0.6574324  0.6994932
##    3    0.2    0.7537780  0.6724662  0.7001689
##    3    0.3    0.7572072  0.6902027  0.6783784
##    3    0.4    0.7595605  0.6836149  0.6896959
##    3    0.5    0.7597257  0.6981419  0.6864865
##    4    0.1    0.7470650  0.6795608  0.6770270
##    4    0.2    0.7477484  0.6655405  0.6976351
##    4    0.3    0.7501731  0.6812500  0.6792230
##    4    0.4    0.7503127  0.6645270  0.6986486
##    4    0.5    0.7587360  0.6896959  0.6836149
##    5    0.1    0.7457530  0.6633446  0.6979730
##    5    0.2    0.7457224  0.6826014  0.6829392
##    5    0.3    0.7498817  0.6920608  0.6640203
##    5    0.4    0.7477496  0.6809122  0.6782095
##    5    0.5    0.7535584  0.6903716  0.6785473
##    6    0.1    0.7418939  0.6530405  0.6957770
##    6    0.2    0.7411864  0.6597973  0.6861486
##    6    0.3    0.7449520  0.6672297  0.6858108
##    6    0.4    0.7433482  0.6731419  0.6826014
##    6    0.5    0.7499916  0.6761824  0.6900338
##    7    0.1    0.7376782  0.6726351  0.6738176
##    7    0.2    0.7370763  0.6722973  0.6716216
##    7    0.3    0.7400095  0.6689189  0.6809122
##    7    0.4    0.7426851  0.6741554  0.6795608
##    7    0.5    0.7449045  0.6795608  0.6721284
##    8    0.1    0.7373847  0.6689189  0.6738176
##    8    0.2    0.7386485  0.6687500  0.6798986
##    8    0.3    0.7365913  0.6704392  0.6692568
##    8    0.4    0.7378873  0.6719595  0.6778716
##    8    0.5    0.7370564  0.6712838  0.6709459
##    9    0.1    0.7338159  0.6635135  0.6771959
##    9    0.2    0.7291759  0.6658784  0.6670608
##    9    0.3    0.7345225  0.6641892  0.6771959
##    9    0.4    0.7359239  0.6646959  0.6675676
##    9    0.5    0.7371851  0.6707770  0.6748311
##   10    0.1    0.7280691  0.6643581  0.6557432
##   10    0.2    0.7309040  0.6655405  0.6677365
##   10    0.3    0.7261672  0.6515203  0.6714527
##   10    0.4    0.7291849  0.6672297  0.6648649
##   10    0.5    0.7325745  0.6646959  0.6748311
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.3.
plot(nnet)

Looking at our importance graph we see that the neural network considers having software in your job title as more important to contributing to being above or below the median income. Additionally, having a professional degree now makes its way to the top of the chart. These are likely the inputs that have the highest weights within our neural network, however just like the XGBoost model we are unable to determine a directionality of importance. In this case, it could even be having software in your job title actually having you lose probability to be above median state income.

plot(varImp(nnet), top=20)

Now perhaps the most shocking bit of information in this report is the fact that our neural network that took the longest to tune and had the most complex variations performs only 0.05% better than our elm model and is actually worse than our “simpler” XGBoost model. It just goes to show that just because a model seems complex and popular, does not necessarily mean it will hands down beat another properly fit and tuned model.

(nn <- nnet |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1816 1305
##       high  720 2997
##                                                
##                Accuracy : 0.7039               
##                  95% CI : (0.6929, 0.7147)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3941               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.7161               
##             Specificity : 0.6967               
##          Pos Pred Value : 0.5819               
##          Neg Pred Value : 0.8063               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2656               
##    Detection Prevalence : 0.4564               
##       Balanced Accuracy : 0.7064               
##                                                
##        'Positive' Class : low                  
## 
evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="Neural Network",nn$overall["Accuracy"],nn$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )

Conclusion

With our classification models built, we want to see which one has the best performance for our purposes. Notice that we had two initially proposed goals at the outset of this exercise. We wanted the best predictive power from the standpoint of a business that is hiring, while we wanted an interpretable model from the standpoint of a person who is looking for a job and wants to know what would get them paid more than the median in a state. We can evaluate the first criterion by simply looking at the accuracy while also considering sensitivity and specificity as the dataset was initially unbalanced. When comparing the models in this way, XGBoost clearly comes up on top, but elastic net logit is only less than 1% point of accuracy behind.

If the XGBoost model were perhaps in the realm of 90% accurate while the elastic net logit model was stuck behind in the 70% accuracy realm, then we would definitely pick our XGBoost model for the first goal of a business attempting to predict how much they should pay an employee. However, when a less than 1% difference in accuracy is also the difference between full interpretability and much more limited interpretability it only makes sense to choose the more interpretable model for both of our use cases.

evalm %>% 
  mutate(
    across(
      c(Accuracy:`Balanced Accuracy`),
      ~round(as.numeric(.x)
             , 4)
    )
  ) %>% 
  arrange(desc(`Accuracy`)) %>% 
  knitr::kable()
Model Accuracy Sensitivity Specificity Balanced Accuracy
XGBoost 0.7125 0.7236 0.7060 0.7148
Neural Network 0.7039 0.7161 0.6967 0.7064
Elastic Net Logit 0.7034 0.7200 0.6936 0.7068

While using the logit model which has the coefficients outlined the business could make statements that they are justified in paying someone more because the model suggests that those who have more years of experience in the field have their odds of being paid more increased by 5.65 log odds. Alternatively a prospective employee would be able to see that they should be more likely to be paid above the median state income by a log odds of 0.22 compared to those who have not disclosed their education status, and decide to pursue a further education.

70% accuracy with a balance betweem sensitivity and specificity is a reasonably accurate model for making these predictions. However, we do ideally want to eke out as much accuracy as possible. The fact that all of our models seemed to optimize around the same accuracy suggests that this is the highest accuracy we can get with our data in its current form. It is very likely that there is no problem with model choice. We got close to the best results for our dataset. The problem lies within our dataset itself. There were not enough categories that could reliably contribute to predicting a higher or lower than median income model. Perhaps, if there was a question that asked how they believed their income compared to their peers we would be able to make much more accurate predictions. Additionally, we could have possibly extracted more value from the messier columns such as job titles and job context fields, however the process would be painstaking.

A new survey with questions that only have pre-allowed responses to extract the most predictive power from each question would be the ideal way to extend this attempt at predicting incomes and income classification. Alternatively, a multi-layered neural network which was fed the original data might be able to properly feature select the text heavy fields of job title and job context for the best predictions.