Introduction

Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data to predict if the salaries in the survey end up above or below per capita personal income for state. We originally stated that we would get the state income data from the Census, but found that it only contained median household income. Therefore we are using personal income per capita for 2022, per FRED (Federal Reserve Economic Data: https://fred.stlouisfed.org/release/tables?eid=257197&rid=110).

The data itself is a bit messy being a real-world survey dataset, so this will take a decent amount of data transformation and cleaning. We also want to take two approaches for our predictions: a logistic type of model for predicting if a person ends up above or below median income and a separate quantitative model for predicting income itself.

Loading Data

Here we load:

  • the Salary Survey data
  • the Personal Income Per Capita by State

Column Names

Before we do anything else, we want to rename the columns in the Salary Survey Data. As is typical of survey data, the column names are the full text of the questions, and as such are long and unwieldy. It would be cumbersome to do anything in this form.

Description of Variables

Below is table showing the variables in the Salary Survey data set:

Variable xxxx Survey Question
timestamp Timestamp on the survey response
age How old are you?
industry What industry do you work in?
job_title Job title
title_context If your job title needs additional context, please clarify here
annual_sal What is your annual salary? (You’ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent – what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
add_comp How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
ccy Please indicate the currency
other_ccy If “Other,” please indicate the currency here
income_context If your income needs additional context, please provide it here
ctry What country do you work in?
state If you’re in the U.S., what state do you work in?
city What city do you work in?
yrs_wrk_exp How many years of professional work experience do you have overall?
yrs_in_field_exp How many years of professional work experience do you have in your field?
edu What is your highest level of education completed?
gender What is your gender?
race What is your race? (Choose all that apply.)

And a table showing the variables in the Personal Income Per Capita by State:

Variable xxxx Description
state The name of the US State
PIncPerCap The personal income per capita for that state, for 2022, per FRED (Federal Reserve Economic Data)

Data Exploration

Let’s take a glimpse at the Salary Survey Data. There are 27,992 rows and 18 columns.

## Rows: 27,992
## Columns: 18
## $ timestamp        <chr> "4/27/2021 11:02:09", "4/27/2021 11:02:21", "4/27/202…
## $ age              <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34",…
## $ industry         <chr> "Education (Higher Education)", "Computing or Tech", …
## $ job_title        <chr> "Research and Instruction Librarian", "Change & Inter…
## $ title_context    <chr> "", "", "", "", "", "", "", "High school, FT", "Data …
## $ annual_sal       <int> 55000, 54600, 34000, 62000, 60000, 62000, 33000, 5000…
## $ add_comp         <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0…
## $ ccy              <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "USD…
## $ other_ccy        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ income_context   <chr> "", "", "", "", "", "", "", "", "", "I work for a Cha…
## $ ctry             <chr> "United States", "United Kingdom", "US", "USA", "US",…
## $ state            <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "South…
## $ city             <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "G…
## $ yrs_wrk_exp      <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 y…
## $ yrs_in_field_exp <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years",…
## $ edu              <chr> "Master's degree", "College degree", "College degree"…
## $ gender           <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "Ma…
## $ race             <chr> "White", "White", "White", "White", "White", "White",…

Data Preprocessing

Before beginning our exploration, we need to take some steps to create the data to be modeled. We will:

  • convert the empty stings to NA
  • make the category fields factors and order the values
  • limit to responses from the US
  • make sure state is in the form needed to merge with the personal income per capita from the state_inc data
  • merge the state_inc data
  • sum the annual_sal and the add_comp to get a total_comp
  • create a column that indicates if total_comp is above or below the state per capita income

Once these steps are complete, we can explore our data and do any data preparation that might be advantageous in our modeling.

Empty Strings to NA

Many columns have empty strings that should be NA.

Factors and Order

These columns are category fields to made into factors: age, yrs_wrk_exp, yrs_in_field_exp, edu and gender.

## $age
## [1] "25-34"      "45-54"      "35-44"      "18-24"      "65 or over"
## [6] "55-64"      "under 18"  
## 
## $yrs_wrk_exp
## [1] "5-7 years"        "8 - 10 years"     "2 - 4 years"      "21 - 30 years"   
## [5] "11 - 20 years"    "1 year or less"   "41 years or more" "31 - 40 years"   
## 
## $yrs_in_field_exp
## [1] "5-7 years"        "2 - 4 years"      "21 - 30 years"    "11 - 20 years"   
## [5] "1 year or less"   "8 - 10 years"     "31 - 40 years"    "41 years or more"
## 
## $edu
## [1] "Master's degree"                    "College degree"                    
## [3] "PhD"                                "Some college"                      
## [5] "High School"                        "Professional degree (MD, JD, etc.)"
## 
## $gender
## [1] "Woman"                         "Non-binary"                   
## [3] "Man"                           "Other or prefer not to answer"
## [5] "Prefer not to answer"

Looking at these values, it is clear that all but gender can be ordered.

NOTE TO TEAM: When we model, we may find that we do not want this in ‘natural’ order but in order of most to lease common, or most to least likely to have salary > the state per capita income. TBD

Limiting to US only

We will remove any rows not in the US. Part of the messiness of this data is a variety of ways a single country is defined: United States, US, USA etc. We keep it simple and eliminate rows where state=NA. We suspect from the city that 171 rows where state=NA seem to be from the US, and while in some cases the state might be derived from city, in others it is not clear. Since eliminating the rows without a state leaves us with almost 23K rows, this seems the best solution.

After removing all the non US rows, we have 22,995 rows and 17 columns. We removed the ctry column, as it now no longer added value.

State

Taking a look at state, there are 106 cases where multiple states are selected. This is small compared to the total number of rows. We eliminate them we and still have almost 23K rows (22,889).

## [1] "Mississippi, Missouri"                                                        
## [2] "California, Maryland"                                                         
## [3] "California, Illinois, Massachusetts, North Carolina, South Carolina, Virginia"
## [4] "Alabama, California"                                                          
## [5] "Michigan, Texas, Washington"                                                  
## [6] "Alabama, Oregon"
## [1] 106

Merge Data

Now that the state column is clean, we can merge the two data frames, bringing in the per capita income by state to each row.

##        state PIncPerCap
## 1    Alabama      50916
## 2     Alaska      68635
## 3    Arizona      58442
## 4   Arkansas      52618
## 5 California      77036
## 6   Colorado      75722

Target Column for high compensation

We want to create the target column indicating if the salary is above or below Personal Income Per Capita. The steps to do this are:
- sum the annual_sal and add_comp fields to make a total_comp field
- create a column (hi_comp) that compares total_comp and inc_per_cap and set value to 1 if the total_comp is greater or equal and 0 if total_comp is less.

##   annual_sal add_comp total_comp PIncPerCap hi_comp
## 1     120000       NA     120000      50916       1
## 2      70000       NA      70000      50916       1
## 3     120000    15000     135000      50916       1
## 4     175000       NA     175000      50916       1
## 5      40000        0      40000      50916       0
## 6      22800       NA      22800      50916       0

Data Skimming

Now we have the data we want to model. Let’s skim for a quick view of summary statistics, and a visualization of distributions with mini histograms.

After all our pre-proccessing, we have 22,889 responses (rows) with 20 variables.

We notice here that there are many missing values:

  • 16,905 rows are missing additional context to the title (title_context).
  • 22,851 missing values for other currency (other_ccy)
  • 20,408 missing values for additional context for income (income_context)
  • 5,720 missing values for additional compensation (add_comp)
  • small number of rows missing in city (12), education level (edu - 138), gender (130), industry (56), and race (130).

We also see that the numeric data is mostly skewed right, except for PIncPerCap, which is rather normal.

Data summary
Name sal_US
Number of rows 22889
Number of columns 20
_______________________
Column type frequency:
character 10
factor 6
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1.00 4 20 0 51 0
timestamp 0 1.00 16 19 0 20899 0
industry 56 1.00 2 171 0 1022 0
job_title 0 1.00 1 126 0 11916 0
title_context 16905 0.26 1 808 0 5771 10
ccy 0 1.00 3 7 0 8 0
other_ccy 22851 0.00 1 100 0 24 0
income_context 20408 0.11 1 939 0 2435 2
city 12 1.00 1 84 0 3710 6
race 130 0.99 5 125 0 45 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age 0 1.00 TRUE 7 25-: 10180, 35-: 8148, 45-: 2646, 18-: 960
yrs_wrk_exp 0 1.00 TRUE 8 11 : 7978, 8 -: 4371, 5-7: 3924, 21 : 3001
yrs_in_field_exp 0 1.00 TRUE 8 11 : 5401, 5-7: 5363, 2 -: 4944, 8 -: 4130
edu 138 0.99 TRUE 6 Col: 11148, Mas: 7363, Som: 1612, Pro: 1119
gender 130 0.99 FALSE 5 Wom: 17755, Man: 4187, Non: 590, Oth: 226
hi_comp 0 1.00 FALSE 2 1: 14389, 0: 8500

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
annual_sal 0 1.00 92025.95 75165.15 0 56160 78000 112000 5000044 ▇▁▁▁▁
add_comp 5720 0.75 12700.58 46465.79 0 0 2000 10000 1500000 ▇▁▁▁▁
PIncPerCap 0 1.00 69626.18 10301.27 46370 61475 68840 77036 95970 ▁▇▇▅▁
total_comp 0 1.00 101552.64 93505.36 0 58000 81000 120000 5004044 ▇▁▁▁▁

Missing Values

Let’s look at how we might address the missing values.

industry (56)

Industry looks like it was a free text field, and there are over 1,000 different values. Not super helpful. A bit of research reveals a grouping of industry sectors as follows:

  • Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
  • Manufacturing and Construction
  • Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
  • Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
  • Human Services (eg Government, Nonprofits and NGOs)
    We will attempt to group all the industries into one of these groups in a new column - indust_grp.

A sample of the industry values:

## [1] "Computing or Tech"                       
## [2] "Federal Contracting/Business Development"
## [3] "Government and Public Administration"    
## [4] "Law"                                     
## [5] "STEM medical research"                   
## [6] "Education (Higher Education)"

Showing the first few job_title where NA industry:

## [1] "Adult Services Librarian"             
## [2] "Cybersecurity senior risk specialist "
## [3] "Senior Manager of Ops Strategy"       
## [4] "UX Designer"                          
## [5] "Lab Assistant (Level II)"             
## [6] "People Operations Manager "

title_context (16,905)

This field is to provide further explanation if the job_title requires such. Again, this is a free text field, and though most values are NA there are still almost 6K responses, likely to be unique.

Our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the job title (1) or not (0). One thing to consider here is that some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify this case).

## [1] NA                                                                  
## [2] "Senior associate"                                                  
## [3] "Mid- level professional biotechnology laboratory researcher "      
## [4] "na"                                                                
## [5] "Make graphics and videos, write captions post them on social media"
## [6] "Commercial design (not residential)"

other_ccy (22,851)

As we eliminated non US country rows, we would expect the ccy to be all USD, and therefore for other_ccy to be ALL NAs. Instead, there are 13 rows where other_ccy is populated. Look at the values in ccy, we find the assumption that it would all be USD is wrong, as 26 have another ccy.

## 
## AUD/NZD     CAD     EUR     GBP     JPY   Other     SEK     USD 
##       1       5       4       1       1      13       1   22863

Looking at what is in other_ccy, we find that there are a few USD or variations (US Dollar, American Dollars), a few that are another currency or equity/stock. But most seem to be answering a different question.

Since the comparison to the state personal income per capita requires that the total_comp be in USD, we will ‘correct’ the 3 types of ‘others’ that are clearly USD, and then eliminate the other rows that are not USD (or not identifiable as such).

## [1] "Na "                                                 
## [2] "My bonus is based on performance up to 10% of salary"
## [3] "Equity"                                              
## [4] "N/a"                                                 
## [5] "Stock "                                              
## [6] "American Dollars"

income_context (20,408)

Similar to title_context, the purpose of this variable is to provide further explanation if the annual_sal, add_comp and ccy fields requires such. Again, this is a free text field, and though most values are NA, there are still almost 2.5K responses, likely to be unique.

As with title_context, our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the income (1) or not (0). Here again, some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify).

## [1] NA                                                                                                                         
## [2] "Need to Pay Fees"                                                                                                         
## [3] "na"                                                                                                                       
## [4] "24.2/hr"                                                                                                                  
## [5] "Clinical pharmacist in a small hospital"                                                                                  
## [6] "My bonuses are based on my fees in cases and it’s never a consistent amount. Ranges from $10,000-$50,000 per year so far."

city (12)

We notice that in addition to the 12 actual NAs, there are 32 rows with “na”s and another 92 with some variation of “prefer not to answer”.

We are not sure how city might effect the model. Some cities might have higher salaries, but since again, this is free text, there are over 2K unique responses.

Since there are almost as many unique responses as there are rows, and we cannot think of a way to group these, we will eliminate this column.

## [1] "Birmingham"  "Huntsville"  "Birmingham " "Auburn"      "Athens"     
## [6] "na"

Demographic variables: race, edu, gender

There are very similar count of NAs for race, edu and gender (130, 138 and 130, respectively).

There are about 20 or so responses where all three of these are blank. Makes sense to get rid of those rows.

For the rest:

  • gender has both “Other or prefer not to answer” and “Prefer not to answer” as well as NAs. Will combine those into a single “Prefer not to answer”
  • similarly, will add replace the NAs in edu to “Prefer not to answer”.
  • race will take a bit more work, as it is a check all that apply. So in addition to replacing the NAs with “Prefer not to answer”, we also have to split out the field into who chose each of the 7 options provided (one of which is “Another option not listed here or prefer not to answer” - so the NAs will go there).

Note that there are 6 options shown below, but there is also a 7th: ‘Hispanic, Latino, or Spanish origin’, which only appears along with other options selected.

## [1] "White"                                                 
## [2] "Asian or Asian American"                               
## [3] "Another option not listed here or prefer not to answer"
## [4] "Native American or Alaska Native"                      
## [5] "Black or African American"                             
## [6] "Middle Eastern or Northern African"                    
## [7] NA

We now have a plan for what data preparation is needed for missing values. Let’s explore other aspects of the data.

Variable Distribution

The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.

In our first look, visualizing distribution is impacted by the very long tails.

The 75 percentile for annual_sal is less than $150,000, while the max is over $5,000,000. There are a very small (less than 20) number of rows where either annual_sal or add_comp are greater than 1 million. In order to see what the majority (first 75%) of the data looks like, we will filter annual_sal, add_comp and total_comp <150,000.

When we do this, we can see that the large majority of the data is pretty normal. The right skewness for annual_sal, add_comp and total_comp come from the small number of very high values.

PIncPerCap looks multimodal.

Target Flag Breakdown

We want to take a look into what contributes to having a total_comp greater than the PIncPerCap. Look at the distribution of whose total_comp does and who does not by the qualitative variables that we have.

This shows that the percent of respondents with hi_comp=1 (total_comp is greater than PIncPerCap):

  • generally increases with age
  • increases with both yrs_wrk_exp and yrs_in_field_exp until about 30 years then decreases.
  • increases with education levels
  • gender is a bit all over, but needs some clean up to combine NA, and the two variations of “Prefer not to answer”

Target Amount Breakdown

NOTE TO TEAM: I wanted to try this, but all it really shows is that if total_comp is higher, one is more likely to have hi_comp=1. Not really a revelation. Maybe we take this out?

Again here, we look at the lower 3 percentiles. (total_Comp<150000)

Maps

To understand how the Personal Income Per Capita varies by state, we can plot it on a map.

Because it is so small, it is hard to see that DC has the highest per capita personal income, $95,970. But you can see that this is followed by Massachusetts (MA) $84,561 and Connecticut (CT) $82,938. Of course the per capita calculation will be greatly impacted by population, especially the highest population states (California, Texas, Florida and New York) and lowest population states (Wyoming, Vermont, DC, and Alaska).

We are also interested to see how many responses there are in each state. To do this, we create a small table with just the state and the count of rows with that state in the sal_US data.

California (CA) 2,607 and New York (NY) 2,168 are the top two. Wyoming (WY) 24 and South Dakota (SD) 25 are at the bottom. Every state (+ DC) are represented.

For our third map, we are interested in the mean of the total_comp reported by state. As we did above, we will create a small table to capture the data, then map it.

Top 3 are California, $135,362.24 (CA), Washington, $127,442.40 (WA) and New York, $117,769.82 (NY). At the other end of the spectrum are Wyoming, $60,475.96 (WY), Montana, $63,200.66 (MT) and Mississippi, $63,929.22 (MS).


Data Preparation

Now that we have taken a good look at our data, we can begin to prepare it to be useful in modeling to both predict when total_comp is greater than personal income per capita and the predict actual total_comp.

In order to be able to compare to our (pre-processed but still) baseline data, we will create a new data frame, sal_US1.

Missing Values

Some missing values will actually be addressed by creating new categories, so those will be discussed in the following section.

ccy

To make valid comparisons, we want all responses to be reported in USD. As described above, there are a few rows with ccy = ‘other’ where the value in other_ccy make it clear it is USD. We will update those and then delete the 23 non-USD rows. Once that is complete, we no longer need ccy or other_ccy, so we will remove those columns.

city

Recall city this is free text, with almost as many unique values as rows, and no good way to categorize it. We do not need to deal with the NAs, as we will remove this column.

New Categorical Columns

Industry Group

As described above, we will group industry into 5 categories as follows:

  • Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
  • Manufacturing and Construction Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
  • Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
  • Human Services (eg Government, Nonprofits and NGOs)

This will take some iterative effort to identify the group for all the many many varied values of the free text field industry.

Once complete, we can see that Information Services has the largest number of responses, and very few were from the Raw Materials Industry Group.

## 
##                 Human Services           Information Services 
##                           4166                           9083 
## Manufacturing and Construction                  Raw Materials 
##                           1876                            169 
##                        Service 
##                           7523

There are 58 rows left with NA in industry and therefore in indust_group. We will remove those rows, and then remove industry, leaving just the indust_group column.

We now have 22,817 rows and 17 variables.

Context fields

We will turn both context fields, title_context and income_context into flags (0,1) if there is additional context. The values in these fields are too varied to be of any use in categorizing any other way. But it is possible a response where it was necessary to add context might impact total_comp one way or another. We will then remove the original fields, leaving the same number of variables, 17.

As we can see, we have 5,974 rows with title_context_flg=1

## 
##     0     1 
## 16843  5974

And 2,475 rows with income_comtext_flg=1

## 
##     0     1 
## 20342  2475

Demographic Columns

There are a few rows where race, edu and gender are all NA. These will be removed. This leave 22,799 rows.

Next, we want to combine “Other or prefer not..” + “Prefer not..” + NAs in gender, leaving no NAs and a more organized set of values.

## [1] "Man"                  "Woman"                "Non-binary"          
## [4] "Prefer not to answer"

Similarly, we want to combine “Prefer not..” + NAs in edu, leaving no NAs and a more organized set of values.

## [1] "College degree"                     "Master's degree"                   
## [3] "Professional degree (MD, JD, etc.)" "High School"                       
## [5] "Some college"                       "PhD"                               
## [7] "Prefer not to answer"

On race, we combine “Another or prefer not..” + NAs. Since this is a pick all that apply field we then must make a set of flags to indicate indicate if one of the 7 choices was / was not selected. Once those fields are created, we remove the original multiselect race column.

Since we were not able to look at these fields in the Target Flag Breakdown section of the Data Exploration, we will do so here.

add_comp

Much of add_comp is NA, as not everyone gets compensation beyond salary. We have captured the added amount in total_comp (annual_sal+add_comp). But it may be of interest if add_comp is received or not. So, here again, we create a flag to indicate if add_comp was populated (1) or not (0). Then we remove the original field.

As we can see, we have 17,110 rows with add_comp_flg=1

## 
##     0     1 
##  5689 17110

Job Title

This is likely one of the most powerful variables we have for predicting our response. As it is widely known that the type of job you have is directly related to how much income you earn. The problem with using this field in regression models is that we right now only have a character vector column with a multitude of different responses that can not be reasonably factorized as is.

Thus, we will attempt to do some natural language processing to extract a bit of the essence of each job title from the row.

library(tidytext)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
## 
##     tokenize
# We begin by adding some common stop words that do not have any value from our dataset into an extended stop word dictionary.

stop_words %>% 
  add_row(word= as.character(1:5), lexicon = "group_5") %>%
  add_row(word= c("i","I","ii","II","iii","III"), lexicon = "group_5") -> stop_words

# Then we tokenize the job titles in order to be able to remove the stop words and further process.

text_df <- tibble(line = 1:length(sal_US1$job_title), text = sal_US1$job_title) %>% 
  unnest_tokens(word,text) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
# We take the lemmatization of each word used in job title to better standarize the meaning behind the job titles 

text_df$word <- lemmatize_words(text_df$word)

# Below we find those words which occur more than 50 times in order to use that as our singular indicator instead of job title. We keep only the lowest count word for each row of the job title column in order to maximize variability for prediction.

text_df %>% 
  count(word, sort = TRUE) %>% 
  filter(n >= 50) -> top_role

text_df %>% 
  inner_join(top_role, join_by(word == word)) %>% 
  arrange(n) %>% 
  distinct(line,.keep_all = TRUE) -> text_df

sal_US1 %>%
  mutate(Rownumber = row_number()) %>% 
  left_join(text_df, join_by(Rownumber == line )) %>%
  mutate(job_title = word) %>% 
  mutate(job_title = if_else(is.na(job_title), "other", job_title)) %>% 
  select(-Rownumber, -word, -n) -> sal_US1

Timestap

Timestamp is just the datetime of the response so we will remove it.

We now have 22,799 rows with 21 variables which we have either cleaned or created.

Outliers

To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.

We will create flags for each variable to identify potential outliers.

Using this method, we created:
- annual_salOUT: which shows we have 891 (4%) potential outliers in annual_sal - total_compOUT: which shows 1,250 (5%) potential Outliers in total_comp

The table below shows the resulting values (TRUE is a potential outlier).

##       annual_salOUT total_compOUT
## FALSE         21998         21639
## TRUE            891          1250

Correlation

Since most of the data is categorical, we can use a correlation funnel to look for correlation in our data.

We take a look at what is correlated to the highest compensation as well to help see what feeds into our other response variable.

Looking at the correlation for job titles we can see that there are some job titles that are slightly correlated with total compensation and high compensation, however the vast majority of job_titles have little to no correlation with either. Thus, we segment our job titles into these with the five highest magnitude of correlation and an other class. This will lessen the compute requirements on any models that we build while not losing much salient information.

## [1] "engineer"       "other"          "librarian"      "director"      
## [5] "software"       "administrative"

Build Models

With our data processed and explored we are ready to begin the exercise of building our models. We will have two distinct classes of models that we build: regression models in order to attempt to predict total compensation and classification models in order to attempt to predict if a given worker has greater or lesser than median compensation for their state.

We’ll begin by splitting our processed data into train and test sets. While doing this split, we’ll want to take into consideration that those salaries with a lower compensation are less representative in our dataset. If we just leave the data as is and split based on the existing class distribution then we’ll have any of our models more likely to predict the higher salary class simply for the fact that it is more representative. Right now the data is only marginally imbalanced according to Foundations of Imbalanced Learning. So, we will make a split on the data without rebalancing the classes for now. However, if we find later that there is large bias towards one class or the other for classification, we will utilize undersampling of rows that are above the median income.

Before splitting our data we also take the opportunity to relevel some of our categorical variables. We do this in order to create a logistic regression model equation that has a more sensible baseline rather than using random factors as the base level. For example, for gender and education we have the baseline set to non-answers. So, we can see the coefficients of male,female, and non-binary in order to intuit how important they are towards having a higher income. We also have our reference level of compensation set to lower compensation as this is the majority class and we want our coefficients to inform us on what leads to having a higher income rather than the other way around.

## [1] 15961    22

We utilize a 70% training to 30% test data split for our dataset to get a good portion of data into each set.

Regression Models

Our first order of model building business is to attempt to predict total compensation.

Backwards Stepwise Parameter Selection

First we run a multiple linear regression model using backwards step-wise elimination of variables. Initially we have to eliminate hi_comp since it is a variable that’s derived from our target variables and so shouldn’t be used as a predictor. Additionally we have to eliminate annual_sal since it, along with a presumable “bonuses paid” variable that is not present, is a component of annual_sal

## 
## Call:
## lm(formula = total_comp ~ state + age + job_title + yrs_wrk_exp + 
##     yrs_in_field_exp + edu + gender + indust_grp + title_context_flg + 
##     income_context_flg + White + HispLatSp + add_comp_flg, data = sal_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188687  -33268   -8509   17990 4823627 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                              116350.1    12385.4   9.394
## stateAlabama                             -46191.7    10721.7  -4.308
## stateAlaska                              -32207.8    12626.3  -2.551
## stateArizona                             -34621.8     7064.9  -4.901
## stateArkansas                            -30326.1    13129.4  -2.310
## stateCalifornia                            9415.2     3946.3   2.386
## stateColorado                            -22690.6     5326.1  -4.260
## stateConnecticut                         -19749.9     7676.5  -2.573
## stateDelaware                            -26211.1    14379.3  -1.823
## stateFlorida                             -35012.7     5619.6  -6.230
## stateGeorgia                             -26428.9     5556.5  -4.756
## stateHawaii                              -22777.6    19554.0  -1.165
## stateIdaho                               -35784.6    10695.9  -3.346
## stateIllinois                            -21347.5     4482.0  -4.763
## stateIndiana                             -41253.4     6484.4  -6.362
## stateIowa                                -42074.7     8405.3  -5.006
## stateKansas                              -43556.8     8777.5  -4.962
## stateKentucky                            -49452.4     7961.8  -6.211
## stateLouisiana                           -43420.3     9335.9  -4.651
## stateMaine                               -42742.7    10049.7  -4.253
## stateMaryland                            -26768.5     5486.5  -4.879
## stateMassachusetts                       -13989.3     4308.5  -3.247
## stateMichigan                            -36103.8     5591.6  -6.457
## stateMinnesota                           -34132.6     5109.0  -6.681
## stateMississippi                         -67885.7    16345.7  -4.153
## stateMissouri                            -37153.4     6516.6  -5.701
## stateMontana                             -40209.8    13407.7  -2.999
## stateNebraska                            -53574.8    10635.2  -5.038
## stateNevada                              -35826.4    11274.6  -3.178
## stateNew Hampshire                       -40674.3    10267.3  -3.962
## stateNew Jersey                             577.7     6316.5   0.091
## stateNew Mexico                          -38010.1    10827.6  -3.510
## stateNew York                             -4745.1     4017.8  -1.181
## stateNorth Carolina                      -26644.0     5423.1  -4.913
## stateNorth Dakota                        -42067.4    16647.2  -2.527
## stateOhio                                -37674.5     5313.1  -7.091
## stateOklahoma                            -44123.3     9830.5  -4.488
## stateOregon                              -26912.3     5353.7  -5.027
## statePennsylvania                        -32983.7     4756.4  -6.935
## stateRhode Island                        -35725.9    12408.2  -2.879
## stateSouth Carolina                      -43566.7     8961.7  -4.861
## stateSouth Dakota                        -63789.2    20055.7  -3.181
## stateTennessee                           -43823.1     7089.3  -6.182
## stateTexas                               -24318.8     4474.8  -5.435
## stateUtah                                -27262.2     7750.0  -3.518
## stateVermont                             -34310.5    13122.1  -2.615
## stateVirginia                            -20602.7     4966.9  -4.148
## stateWashington                            3137.7     4507.0   0.696
## stateWest Virginia                       -59143.4    16629.0  -3.557
## stateWisconsin                           -35489.7     5878.9  -6.037
## stateWyoming                             -50980.3    20033.9  -2.545
## age.L                                    -11565.9    17424.3  -0.664
## age.Q                                     -5957.7    16487.9  -0.361
## age.C                                     -4083.0    12381.5  -0.330
## age^4                                       441.6     7770.8   0.057
## age^5                                     -9756.3     4163.5  -2.343
## age^6                                     -3241.0     2030.9  -1.596
## job_titleengineer                         43063.9     4567.5   9.428
## job_titlelibrarian                       -22437.0     6193.1  -3.623
## job_titledirector                         24446.3     4724.5   5.174
## job_titlesoftware                         56088.9     4195.9  13.367
## job_titleadministrative                  -27670.9     6539.7  -4.231
## yrs_wrk_exp.L                              1440.9     9135.5   0.158
## yrs_wrk_exp.Q                            -14569.3     7591.1  -1.919
## yrs_wrk_exp.C                            -10332.9     5871.5  -1.760
## yrs_wrk_exp^4                             -2393.8     4445.1  -0.539
## yrs_wrk_exp^5                             -2252.4     3281.6  -0.686
## yrs_wrk_exp^6                             -1546.7     2362.0  -0.655
## yrs_wrk_exp^7                             -2018.7     1725.2  -1.170
## yrs_in_field_exp.L                        50243.4    11786.9   4.263
## yrs_in_field_exp.Q                       -23104.2    11005.2  -2.099
## yrs_in_field_exp.C                       -22175.8     8873.7  -2.499
## yrs_in_field_exp^4                        -6184.8     6604.2  -0.936
## yrs_in_field_exp^5                         6724.7     4577.2   1.469
## yrs_in_field_exp^6                         9324.2     3031.1   3.076
## yrs_in_field_exp^7                         3890.1     1962.6   1.982
## eduCollege degree                          -695.7     9771.5  -0.071
## eduMaster's degree                         6609.8     9802.3   0.674
## eduProfessional degree (MD, JD, etc.)     60453.2    10240.7   5.903
## eduHigh School                           -17291.1    11025.5  -1.568
## eduSome college                          -13976.5    10071.3  -1.388
## eduPhD                                    25882.2    10223.9   2.532
## genderMan                                 24308.0     5867.9   4.143
## genderWoman                               -5860.7     5696.4  -1.029
## genderNon-binary                         -14083.9     7060.6  -1.995
## indust_grpInformation Services            -1933.7     1706.5  -1.133
## indust_grpManufacturing and Construction  -6627.2     2758.1  -2.403
## indust_grpHuman Services                 -23892.5     2085.9 -11.454
## indust_grpRaw Materials                    1404.5     8022.4   0.175
## title_context_flg1                       -11453.1     1575.3  -7.270
## income_context_flg1                       23972.7     2221.6  10.790
## White                                    -13695.7     2214.4  -6.185
## HispLatSp                                -15163.3     3570.2  -4.247
## add_comp_flg1                             18821.3     1594.5  11.804
##                                                      Pr(>|t|)    
## (Intercept)                              < 0.0000000000000002 ***
## stateAlabama                                0.000016554404004 ***
## stateAlaska                                          0.010755 *  
## stateArizona                                0.000000965185625 ***
## stateArkansas                                        0.020912 *  
## stateCalifornia                                      0.017052 *  
## stateColorado                               0.000020535354194 ***
## stateConnecticut                                     0.010098 *  
## stateDelaware                                        0.068347 .  
## stateFlorida                                0.000000000476917 ***
## stateGeorgia                                0.000001988231822 ***
## stateHawaii                                          0.244095    
## stateIdaho                                           0.000823 ***
## stateIllinois                               0.000001925111714 ***
## stateIndiana                                0.000000000204689 ***
## stateIowa                                   0.000000562413279 ***
## stateKansas                                 0.000000703782672 ***
## stateKentucky                               0.000000000538951 ***
## stateLouisiana                              0.000003331677394 ***
## stateMaine                                  0.000021199687617 ***
## stateMaryland                               0.000001076896928 ***
## stateMassachusetts                                   0.001169 ** 
## stateMichigan                               0.000000000110014 ***
## stateMinnesota                              0.000000000024543 ***
## stateMississippi                            0.000032968932877 ***
## stateMissouri                               0.000000012096525 ***
## stateMontana                                         0.002713 ** 
## stateNebraska                               0.000000476800898 ***
## stateNevada                                          0.001488 ** 
## stateNew Hampshire                          0.000074787655262 ***
## stateNew Jersey                                      0.927125    
## stateNew Mexico                                      0.000449 ***
## stateNew York                                        0.237616    
## stateNorth Carolina                         0.000000905669411 ***
## stateNorth Dakota                                    0.011514 *  
## stateOhio                                   0.000000000001390 ***
## stateOklahoma                               0.000007226207223 ***
## stateOregon                                 0.000000504116381 ***
## statePennsylvania                           0.000000000004231 ***
## stateRhode Island                                    0.003992 ** 
## stateSouth Carolina                         0.000001176552111 ***
## stateSouth Dakota                                    0.001472 ** 
## stateTennessee                              0.000000000650021 ***
## stateTexas                                  0.000000055737293 ***
## stateUtah                                            0.000437 ***
## stateVermont                                         0.008939 ** 
## stateVirginia                               0.000033714396820 ***
## stateWashington                                      0.486322    
## stateWest Virginia                                   0.000377 ***
## stateWisconsin                              0.000000001606762 ***
## stateWyoming                                         0.010946 *  
## age.L                                                0.506841    
## age.Q                                                0.717849    
## age.C                                                0.741581    
## age^4                                                0.954679    
## age^5                                                0.019127 *  
## age^6                                                0.110552    
## job_titleengineer                        < 0.0000000000000002 ***
## job_titlelibrarian                                   0.000292 ***
## job_titledirector                           0.000000231393040 ***
## job_titlesoftware                        < 0.0000000000000002 ***
## job_titleadministrative                     0.000023370697535 ***
## yrs_wrk_exp.L                                        0.874676    
## yrs_wrk_exp.Q                                        0.054967 .  
## yrs_wrk_exp.C                                        0.078453 .  
## yrs_wrk_exp^4                                        0.590226    
## yrs_wrk_exp^5                                        0.492491    
## yrs_wrk_exp^6                                        0.512593    
## yrs_wrk_exp^7                                        0.241971    
## yrs_in_field_exp.L                          0.000020318201585 ***
## yrs_in_field_exp.Q                                   0.035798 *  
## yrs_in_field_exp.C                                   0.012463 *  
## yrs_in_field_exp^4                                   0.349033    
## yrs_in_field_exp^5                                   0.141811    
## yrs_in_field_exp^6                                   0.002101 ** 
## yrs_in_field_exp^7                                   0.047480 *  
## eduCollege degree                                    0.943245    
## eduMaster's degree                                   0.500123    
## eduProfessional degree (MD, JD, etc.)       0.000000003637373 ***
## eduHigh School                                       0.116833    
## eduSome college                                      0.165231    
## eduPhD                                               0.011366 *  
## genderMan                                   0.000034526952172 ***
## genderWoman                                          0.303569    
## genderNon-binary                                     0.046092 *  
## indust_grpInformation Services                       0.257177    
## indust_grpManufacturing and Construction             0.016281 *  
## indust_grpHuman Services                 < 0.0000000000000002 ***
## indust_grpRaw Materials                              0.861027    
## title_context_flg1                          0.000000000000375 ***
## income_context_flg1                      < 0.0000000000000002 ***
## White                                       0.000000000637114 ***
## HispLatSp                                   0.000021776361416 ***
## add_comp_flg1                            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85990 on 15867 degrees of freedom
## Multiple R-squared:  0.1848, Adjusted R-squared:   0.18 
## F-statistic: 38.67 on 93 and 15867 DF,  p-value: < 0.00000000000000022

Model Results

Most saliently, the Multiple R-squared is 18.48% and the Adjusted R-squared of the model is 0.18. That means only 18.48% of the variability in the total_comp is explained by the predictor variables and only 18% after a penalty for including too many irrelevant predictor variables.

Also the Residual standard error is 85,990 on 15,867 degrees of freedom. Lower values indicate a better fit but since our data isn’t scaled it’s harder to tell between models how well our model is doing.

Looking at the coefficients and p-values of the predictor variables there’s a lot of information about the relationships in the model.

California, New Jersey and Washington were the only states to have a positive coefficient out of the many states so you can assume that people in those states earn more money on average than their counterparts in other states.

Out of Age, Years of Work Experience, and Years in Field Experience, three variables that would seem to be correlated, the first two have relatively higher p-values. So it may be that we only need Years in Field Experience and then we could improve our model. For example, Years in Field Experience may be a proxy for age and so we would only need one of them. It may also be that Years of Work Experience is less relevant than Years in Field Experience because people who job hop between industries may either be unpredictably high earners or with little direction.

Streamlined Model

For our second model we’re going to remove age and yrs_wrk_exp under the assumption that both are correlated with yrs_in_field_exp and so are not needed for the model.

## 
## Call:
## lm(formula = total_comp ~ state + job_title + yrs_in_field_exp + 
##     edu + gender + indust_grp + title_context_flg + income_context_flg + 
##     White + HispLatSp + add_comp_flg, data = sal_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -186213  -33532   -8571   17925 4827956 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                              118945.2    11852.9  10.035
## stateAlabama                             -46915.3    10720.3  -4.376
## stateAlaska                              -31954.4    12630.3  -2.530
## stateArizona                             -34457.6     7061.9  -4.879
## stateArkansas                            -30144.5    13134.4  -2.295
## stateCalifornia                            9677.2     3942.6   2.455
## stateColorado                            -22223.5     5325.8  -4.173
## stateConnecticut                         -18770.4     7676.8  -2.445
## stateDelaware                            -27056.4    14385.6  -1.881
## stateFlorida                             -34742.9     5618.3  -6.184
## stateGeorgia                             -26383.0     5555.7  -4.749
## stateHawaii                              -24237.6    19563.9  -1.239
## stateIdaho                               -35698.5    10698.5  -3.337
## stateIllinois                            -21383.8     4481.3  -4.772
## stateIndiana                             -41219.9     6484.2  -6.357
## stateIowa                                -41718.8     8410.3  -4.960
## stateKansas                              -43909.0     8779.2  -5.002
## stateKentucky                            -49249.3     7963.6  -6.184
## stateLouisiana                           -42794.1     9338.6  -4.583
## stateMaine                               -42098.8    10048.3  -4.190
## stateMaryland                            -26378.2     5486.1  -4.808
## stateMassachusetts                       -13819.8     4310.2  -3.206
## stateMichigan                            -35779.6     5592.1  -6.398
## stateMinnesota                           -33708.2     5108.6  -6.598
## stateMississippi                         -67256.4    16355.5  -4.112
## stateMissouri                            -37450.7     6517.1  -5.747
## stateMontana                             -39089.5    13411.4  -2.915
## stateNebraska                            -53347.9    10637.4  -5.015
## stateNevada                              -35465.5    11281.6  -3.144
## stateNew Hampshire                       -39983.9    10270.3  -3.893
## stateNew Jersey                             293.8     6314.9   0.047
## stateNew Mexico                          -37900.3    10830.8  -3.499
## stateNew York                             -4423.5     4019.1  -1.101
## stateNorth Carolina                      -27093.2     5420.0  -4.999
## stateNorth Dakota                        -40527.4    16653.8  -2.434
## stateOhio                                -37428.4     5313.5  -7.044
## stateOklahoma                            -45050.4     9835.9  -4.580
## stateOregon                              -26494.0     5344.3  -4.957
## statePennsylvania                        -32636.9     4756.0  -6.862
## stateRhode Island                        -34875.5    12412.2  -2.810
## stateSouth Carolina                      -42672.9     8963.9  -4.761
## stateSouth Dakota                        -64295.7    20065.8  -3.204
## stateTennessee                           -43629.4     7091.0  -6.153
## stateTexas                               -24197.1     4472.6  -5.410
## stateUtah                                -27321.7     7752.7  -3.524
## stateVermont                             -33333.7    13124.9  -2.540
## stateVirginia                            -20687.9     4966.0  -4.166
## stateWashington                            3304.7     4500.8   0.734
## stateWest Virginia                       -59244.0    16629.6  -3.563
## stateWisconsin                           -35421.5     5879.7  -6.024
## stateWyoming                             -52924.2    20042.4  -2.641
## job_titleengineer                         43108.7     4566.5   9.440
## job_titlelibrarian                       -23539.7     6195.2  -3.800
## job_titledirector                         24870.6     4726.1   5.262
## job_titlesoftware                         55912.3     4195.9  13.325
## job_titleadministrative                  -27659.4     6538.4  -4.230
## yrs_in_field_exp.L                        40123.5     9957.1   4.030
## yrs_in_field_exp.Q                       -39622.5     9722.6  -4.075
## yrs_in_field_exp.C                       -26651.6     8019.9  -3.323
## yrs_in_field_exp^4                        -7522.3     5985.6  -1.257
## yrs_in_field_exp^5                         3898.0     4165.2   0.936
## yrs_in_field_exp^6                         6429.1     2765.1   2.325
## yrs_in_field_exp^7                         2045.2     1827.5   1.119
## eduCollege degree                          -907.1     9771.3  -0.093
## eduMaster's degree                         6950.0     9800.4   0.709
## eduProfessional degree (MD, JD, etc.)     60440.9    10232.9   5.907
## eduHigh School                           -17696.8    11027.7  -1.605
## eduSome college                          -14112.0    10068.6  -1.402
## eduPhD                                    25224.6    10210.0   2.471
## genderMan                                 24762.9     5871.0   4.218
## genderWoman                               -5507.1     5700.0  -0.966
## genderNon-binary                         -13653.0     7056.3  -1.935
## indust_grpInformation Services            -1964.9     1706.0  -1.152
## indust_grpManufacturing and Construction  -7060.1     2757.1  -2.561
## indust_grpHuman Services                 -23972.6     2087.3 -11.485
## indust_grpRaw Materials                    1610.3     8023.6   0.201
## title_context_flg1                       -11519.9     1575.9  -7.310
## income_context_flg1                       23982.5     2223.3  10.787
## White                                    -13382.6     2209.9  -6.056
## HispLatSp                                -15203.1     3571.9  -4.256
## add_comp_flg1                             18844.9     1595.7  11.810
##                                                      Pr(>|t|)    
## (Intercept)                              < 0.0000000000000002 ***
## stateAlabama                                 0.00001214806501 ***
## stateAlaska                                          0.011416 *  
## stateArizona                                 0.00000107458617 ***
## stateArkansas                                        0.021742 *  
## stateCalifornia                                      0.014116 *  
## stateColorado                                0.00003025059059 ***
## stateConnecticut                                     0.014492 *  
## stateDelaware                                        0.060017 .  
## stateFlorida                                 0.00000000064074 ***
## stateGeorgia                                 0.00000206401543 ***
## stateHawaii                                          0.215404    
## stateIdaho                                           0.000849 ***
## stateIllinois                                0.00000184277611 ***
## stateIndiana                                 0.00000000021142 ***
## stateIowa                                    0.00000071057188 ***
## stateKansas                                  0.00000057490917 ***
## stateKentucky                                0.00000000063919 ***
## stateLouisiana                               0.00000462943183 ***
## stateMaine                                   0.00002808997158 ***
## stateMaryland                                0.00000153678734 ***
## stateMassachusetts                                   0.001347 ** 
## stateMichigan                                0.00000000016154 ***
## stateMinnesota                               0.00000000004290 ***
## stateMississippi                             0.00003939382472 ***
## stateMissouri                                0.00000000927424 ***
## stateMontana                                         0.003566 ** 
## stateNebraska                                0.00000053574855 ***
## stateNevada                                          0.001672 ** 
## stateNew Hampshire                           0.00009936058357 ***
## stateNew Jersey                                      0.962897    
## stateNew Mexico                                      0.000468 ***
## stateNew York                                        0.271074    
## stateNorth Carolina                          0.00000058316918 ***
## stateNorth Dakota                                    0.014964 *  
## stateOhio                                    0.00000000000194 ***
## stateOklahoma                                0.00000468046907 ***
## stateOregon                                  0.00000072169008 ***
## statePennsylvania                            0.00000000000703 ***
## stateRhode Island                                    0.004964 ** 
## stateSouth Carolina                          0.00000194769550 ***
## stateSouth Dakota                                    0.001357 ** 
## stateTennessee                               0.00000000077973 ***
## stateTexas                                   0.00000006391190 ***
## stateUtah                                            0.000426 ***
## stateVermont                                         0.011103 *  
## stateVirginia                                0.00003117500990 ***
## stateWashington                                      0.462813    
## stateWest Virginia                                   0.000368 ***
## stateWisconsin                               0.00000000173531 ***
## stateWyoming                                         0.008284 ** 
## job_titleengineer                        < 0.0000000000000002 ***
## job_titlelibrarian                                   0.000145 ***
## job_titledirector                            0.00000014406359 ***
## job_titlesoftware                        < 0.0000000000000002 ***
## job_titleadministrative                      0.00002346819283 ***
## yrs_in_field_exp.L                           0.00005612520878 ***
## yrs_in_field_exp.Q                           0.00004618263793 ***
## yrs_in_field_exp.C                                   0.000892 ***
## yrs_in_field_exp^4                                   0.208871    
## yrs_in_field_exp^5                                   0.349359    
## yrs_in_field_exp^6                                   0.020081 *  
## yrs_in_field_exp^7                                   0.263105    
## eduCollege degree                                    0.926033    
## eduMaster's degree                                   0.478238    
## eduProfessional degree (MD, JD, etc.)        0.00000000356566 ***
## eduHigh School                                       0.108568    
## eduSome college                                      0.161058    
## eduPhD                                               0.013500 *  
## genderMan                                    0.00002480064213 ***
## genderWoman                                          0.333981    
## genderNon-binary                                     0.053025 .  
## indust_grpInformation Services                       0.249423    
## indust_grpManufacturing and Construction             0.010455 *  
## indust_grpHuman Services                 < 0.0000000000000002 ***
## indust_grpRaw Materials                              0.840937    
## title_context_flg1                           0.00000000000028 ***
## income_context_flg1                      < 0.0000000000000002 ***
## White                                        0.00000000142852 ***
## HispLatSp                                    0.00002089941629 ***
## add_comp_flg1                            < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 86070 on 15880 degrees of freedom
## Multiple R-squared:  0.1825, Adjusted R-squared:  0.1784 
## F-statistic: 44.31 on 80 and 15880 DF,  p-value: < 0.00000000000000022

Model Results

Most saliently, our Multiple R-squared and Adjusted R-squared dropped from 18.48% to 18.25% and 18% to 17.84%.

Also the Residual standard error increased from 85,990 to 86,070 while the degrees of freedom increased from 15,867 to 15,880.

It’s unlikely that any further reduction in predictor variables is going to produce a better model so we would have to look elsewhere to improve the model such as better handling of outliers or taking the natural log of total_comp to try and improve the model.

There’s some more interesting takeaways in the coefficients. The coefficient of a college degree is almost zero, that could be interpreted as support for the idea that a college degree is now assumed in the work space and so is now necessary just to stay average.

There’s a high coefficient for being a man but a very high p-value for being a woman, which I take to mean while there’s a trend for women to make less than men, women have a lot more variability in their salary than men do. It would be interesting to look for similarities in high earning women as opposed to high earning men.


Classification Models

Next we want to attempt to predict if a new person with given features is earning above or below the median compensation for their state.

Unbalanced - Simple Logistic Regression

We begin by building a simple logistic regression model through the glm function with all of our features intact except for the ones that directly influence the response variable. If we leave these variables like total compensation in we can get close to a 100% accuracy but it is purely because the compensation classification response variable was directly derived from it.

The logit model that is built through glm allows for relaxing assumptions on the distributions of our explanatory variables. It is not required for these variables to be normal, instead they are now able to follow multiple different distributions. Our link function will be a logit link which allows for building a model based off of classifying between the two states of our variable and determining how each response variable can have a coefficient which will influence the probability of a prediction falling into either class.

When fitting this model and all of our other models as well, we utilize a five-fold cross-validation approach while having our metric for each validation set being based on ROC. The reason we attempt cross-validation is to reduce overfitting within our model and maximizing its generalizability to new data. ROC is used as a metric because it is especially important to maximize when your data is imbalanced. If we were using accuracy as a metric then the minority class could easily be neglected in the model that we build.

Our data for each model we create is also processed through centering and scaling each column. This allows for models which take into account distance between points and magnitude of values for fitting to be utilized as well.

## Generalized Linear Model 
## 
## 15961 samples
##    21 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (98), scaled (98) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12769, 12768, 12769, 12769, 12769 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7783245  0.5297297  0.8476247

Our first model, shows promising results with its ROC close to 0.8 when looking at its training statistics. Yet, we have a sensitivity of almost 0.5, which means when attempting to determine the minority class of low income it is about as good as flipping a coin. We definitely want to improve on this further as we improve the model. As completely neglecting one class does not make for a good model.

Next, we check some assumptions and qualities for our model that further assess its worth in modeling.

If we look at a standardized residual graph for our model by the index and class we can see that there are not any particular outliers that seem to require attention. Everything falls within 3 standard deviations. However, our class of low income does seem to have more variation than our class of low income which is a cause for slight concern.

We do not showcase the value of each coefficient in the equation because we have almost 100 different terms considering each individual state. Instead we showcase what variables contribute the most to the determination of being classified as below or above state median income.

We see that there are three variables that stand head and shoulder above the rest. If the employee works in the industry of human services, they are much more likely to have a lower than median income for their state. This make sense as it would imply they are working in a position such as a cashier at a grocery store, a position that is typically not so high paying. If the employee gets any additional compensation at all or their job involves utilizing software, then it is perhaps no surprise that overall they are more likely to make above the median cost of living.

Finally, we do see many states making it into the top 20 important variables, yet with 50 different variables coming from the state category, it does make one wonder if we can simply this model by excluding it and still getting relatively close performance.

If we check assumptions further by taking a look at our residual graphs, we can see from the residuals vs fitted graph that the residuals for the two classes are able to be separated into two distinct lines. Which is a good sign showing that our model is achieving separability between the classes fairly well. Taking a look at the Q-Q residual plot we see that the distribution falls into line for a theoretical normal distribution fairly well, an indication that our model is not facing any issues with completely disparate distributions. Looking at the residuals vs leverage plot we are able to reconfirm that we don’t have any outliers that negatively affect our model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1371  660
##       high 1165 3642
##                                                
##                Accuracy : 0.7331               
##                  95% CI : (0.7225, 0.7436)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4037               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.5406               
##             Specificity : 0.8466               
##          Pos Pred Value : 0.6750               
##          Neg Pred Value : 0.7576               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2005               
##    Detection Prevalence : 0.2970               
##       Balanced Accuracy : 0.6936               
##                                                
##        'Positive' Class : low                  
## 

Now if we take a look at our predictions against the test set, we see an overall accuracy of 73.3%, which is a decent amount lower than our initial training ROC would suggest. This would be a fairly accurate model to use for predicting the income of people if not for the fact that we see a sensitivity of 54%. Indicating that most people who are actually low income are simply put into the high income bucket with the chance of a coin flip. Since, we don’t have a particular inclination towards type 1 or type 2 errors, we want to attempt to balance sensitivity and specificity as much as possible. Which is why we will be looking at the balanced accuracy measure going forward to compare our different models.

Unbalanced - Simple Logistic Regression 2

Being able to take a glimpse at the variable importance from our first simple logistic model allows us to refine the variables that we are using even further. We want to remove as many variables as possible in order to allow our model to be more interpretable. Thus we take only six categories from our training dataset and explicitly drop the state category despite the fact that it is an important variable. The reason for this is that we are able to substantially cut down on terms within the resulting equation through dropping said variable. As seen below, we have managed to go from 98 terms that require preprocessing to 26. A reduction of 1/4th the terms making the equation substantially smaller.

## Generalized Linear Model 
## 
## 15961 samples
##     6 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12769, 12768, 12769, 12769, 12769 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7626758  0.5077703  0.8388611

We do lose ROC and Sensitivity from our training evaluation, however the loss of 0.02 is very promising for removing such an important variable. Still, with a sensitivity of 0.508 we already know that this model will not work out well if we want to maintain a good balanced accuracy.

Next, we check some assumptions and qualities for our model that further assess its worth in modeling.

If we look at a standardized residual graph for our model by the index and class we can see that there are not any particular outliers that seem to require attention. Everything falls within 3 standard deviations. In fact, compared to our previous model we have a tighter variance with fewer values going beyond 2 standard deviations. However, once again our class of low income does seem to have more variation than our class of low income which is a cause for slight concern. We will balance our training dataset for the next model as the problem seems to have gotten worse this model.

Once again, we are able to see the three variables of working in human services, additional compensation, and having software in your job title at the top. However, the removal of the state variable allows for more variables to show up within the top 20 of importance. In combination with the coefficient summary below, we see that those working in an administrative position or librarian position (sorry Diana) have a lower chance of getting a being amongst those with an above median income compared to the baseline of those whose job title was uncommonly seen in our dataset and set to other. Sadly, gender makes an appearance in the important variables with men having a higher chance of being above median income compared to the baseline of those who did not identify themselves, and women and non-binary folks are even less likely to be classified as high income against our baseline.

Unsurprisingly we do also get yrs of experience in the field as being an important contributor to if you are high income or not. It should be noted that since we input this variable as an ordered factor the coefficients are provided in terms of a 7th degree polynomial which does make interpretation a bit confusing. Although, it is only the first two terms of the equation which are significant.

## 
## Call:
## NULL
## 
## Coefficients:
##                                             Estimate Std. Error z value
## (Intercept)                                 0.699459   0.020288  34.476
## job_titleengineer                           0.286224   0.031609   9.055
## job_titlelibrarian                         -0.160808   0.018302  -8.786
## job_titledirector                           0.119522   0.021730   5.500
## job_titlesoftware                           0.525901   0.042933  12.249
## job_titleadministrative                    -0.212920   0.023393  -9.102
## yrs_in_field_exp.L                          0.425766   0.063570   6.698
## yrs_in_field_exp.Q                         -0.224794   0.068545  -3.279
## yrs_in_field_exp.C                         -0.087605   0.070024  -1.251
## `yrs_in_field_exp^4`                        0.023325   0.061486   0.379
## `yrs_in_field_exp^5`                        0.018400   0.045044   0.408
## `yrs_in_field_exp^6`                        0.005698   0.031993   0.178
## `yrs_in_field_exp^7`                        0.017922   0.022824   0.785
## `eduCollege degree`                        -0.088829   0.131154  -0.677
## `eduMaster's degree`                        0.079621   0.123047   0.647
## `eduProfessional degree (MD, JD, etc.)`     0.304390   0.060512   5.030
## `eduHigh School`                           -0.146543   0.038443  -3.812
## `eduSome college`                          -0.270484   0.068947  -3.923
## eduPhD                                      0.199814   0.060013   3.329
## genderMan                                   0.150560   0.063018   2.389
## genderWoman                                -0.090325   0.065500  -1.379
## `genderNon-binary`                         -0.153483   0.031089  -4.937
## `indust_grpInformation Services`           -0.070215   0.022029  -3.187
## `indust_grpManufacturing and Construction`  0.135224   0.021846   6.190
## `indust_grpHuman Services`                 -0.326297   0.020652 -15.800
## `indust_grpRaw Materials`                  -0.030619   0.017515  -1.748
## add_comp_flg1                               0.231334   0.018075  12.799
##                                                        Pr(>|z|)    
## (Intercept)                                < 0.0000000000000002 ***
## job_titleengineer                          < 0.0000000000000002 ***
## job_titlelibrarian                         < 0.0000000000000002 ***
## job_titledirector                               0.0000000379040 ***
## job_titlesoftware                          < 0.0000000000000002 ***
## job_titleadministrative                    < 0.0000000000000002 ***
## yrs_in_field_exp.L                              0.0000000000212 ***
## yrs_in_field_exp.Q                                     0.001040 ** 
## yrs_in_field_exp.C                                     0.210906    
## `yrs_in_field_exp^4`                                   0.704427    
## `yrs_in_field_exp^5`                                   0.682910    
## `yrs_in_field_exp^6`                                   0.858634    
## `yrs_in_field_exp^7`                                   0.432337    
## `eduCollege degree`                                    0.498224    
## `eduMaster's degree`                                   0.517584    
## `eduProfessional degree (MD, JD, etc.)`         0.0000004897689 ***
## `eduHigh School`                                       0.000138 ***
## `eduSome college`                               0.0000874360442 ***
## eduPhD                                                 0.000870 ***
## genderMan                                              0.016887 *  
## genderWoman                                            0.167890    
## `genderNon-binary`                              0.0000007938583 ***
## `indust_grpInformation Services`                       0.001436 ** 
## `indust_grpManufacturing and Construction`      0.0000000006017 ***
## `indust_grpHuman Services`                 < 0.0000000000000002 ***
## `indust_grpRaw Materials`                              0.080431 .  
## add_comp_flg1                              < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21050  on 15960  degrees of freedom
## Residual deviance: 17497  on 15934  degrees of freedom
## AIC: 17551
## 
## Number of Fisher Scoring iterations: 6

If we check assumptions further by taking a look at our residual graphs we see very similar patterns to the previous model: We can see from the residuals vs fitted graph that the residuals for the two classes are able to be separated into two distinct lines. Which is a good sign showing that our model is achieving separability between the classes fairly well. Taking a look at the Q-Q residual plot we see that the distribution falls into line for a theoretical normal distribution fairly well, an indication that our model is not facing any issues with completely disparate distributions. Looking at the residuals vs leverage plot we are able to reconfirm that we don’t have any outliers that negatively affect our model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1298  679
##       high 1238 3623
##                                                
##                Accuracy : 0.7197               
##                  95% CI : (0.7088, 0.7303)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3708               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.5118               
##             Specificity : 0.8422               
##          Pos Pred Value : 0.6566               
##          Neg Pred Value : 0.7453               
##              Prevalence : 0.3709               
##          Detection Rate : 0.1898               
##    Detection Prevalence : 0.2891               
##       Balanced Accuracy : 0.6770               
##                                                
##        'Positive' Class : low                  
## 

Now if we take a look at our predictions against the test set, we see an overall accuracy of 72.0%, which is a hit of 1.3% from our previous model. The 1.3% difference would not be a big deal for the gain in interpretability if not for the fact that we see a sensitivity of 51% which has a 3% drop. Bringing us a flea’s hair away from actually being a coin flip on classifying those that are actually low income. Looking at balanced accuracy, we get a drop of 2% which means that this model in its current form is just worse than the previous model on our primary metric.

Balanced - Simple Logistic Regression

However, that doesn’t mean that we’re giving up on this selection of terms for our model. Our next step is to balance the training dataset in order to get better values of sensitivity and thus a better balanced accuracy. With a balanced class distribution, it is less likely for our models to be almost randomly classifying certain classes.

We balance the training dataset by randomly discarding some of the rows within the higher compensation class through a random under classifier function.

## .
##  low high 
## 5920 5920

We lose about 4000 rows, however we can see that our classes are now balanced. Since we already had such a high amount of rows to begin with and our new model formula does not utilize the highest variability features like state, we should still have a model that is able to decently predict classes from the test set and ideally the population.

We fit the same formula and model from the last iteration to our newly balanced training dataset.

## Generalized Linear Model 
## 
## 11840 samples
##     6 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7607944  0.6920608  0.6827703

We ever so slightly lose ROC from our last model, however we have achieved the desired effect of greatly increasing our sensitivity by almost 0.2. This comes at the cost of also lowering the specificity, but since we are aiming for balanced accuracy it is a sacrifice we are willing to make.

Looking at the residuals by index we see something quite unexpected. Our undersampling method seems to have evenly sliced the indices by outcome. Although in theory with cross-validation we are still getting a random selection of training and validation data, it feels appropriate to randomly reindex the data after balancing to ensure random selection of the cross-validation sets.

We resample the data through the simple procedure of calling the sample_n function to randomly place each row and then rerun our model.

## Generalized Linear Model 
## 
## 11840 samples
##     6 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results:
## 
##   ROC       Sens       Spec     
##   0.761133  0.6945946  0.6795608

Our ROC and sensitivity increase slightly while specificity goes down. Overall this is a net gain, but the changes are minuscule and just come from different validation sets being used.

We see our index versus standarized residual graph looking much more regular. There are again no anomalies that make us question our model, however it should be noted with the balanced classes we see that differences in variability between the classes do not seem to be as apparent anymore. What was seen previously was likely an artifact from the higher percentage makeup of the high income class.

With our balanced dataset, we now have a more balanced set of training fit statistics. Yet, the important variables are almost identical as the previous run of this model. It seems a bit strange at first, however looking carefully at the coefficient summary we see that slight changes to coefficients for different terms like the gender coefficients can make all the difference for how things end up being predicted. Each term still contributes or detracts from classifying a row as having high income, but slightly less. Leaving classifications as our base class of lower income more likely.

## 
## Call:
## NULL
## 
## Coefficients:
##                                            Estimate Std. Error z value
## (Intercept)                                 0.02287    0.02148   1.065
## job_titleengineer                           0.26456    0.03016   8.771
## job_titlelibrarian                         -0.15197    0.02245  -6.769
## job_titledirector                           0.11843    0.02241   5.284
## job_titlesoftware                           0.45532    0.03823  11.909
## job_titleadministrative                    -0.23786    0.03246  -7.328
## yrs_in_field_exp.L                          0.45359    0.06628   6.843
## yrs_in_field_exp.Q                         -0.20951    0.07358  -2.848
## yrs_in_field_exp.C                         -0.05591    0.07304  -0.766
## `yrs_in_field_exp^4`                        0.04635    0.06563   0.706
## `yrs_in_field_exp^5`                        0.02494    0.04872   0.512
## `yrs_in_field_exp^6`                        0.00942    0.03459   0.272
## `yrs_in_field_exp^7`                        0.03127    0.02434   1.285
## `eduCollege degree`                        -0.11142    0.14666  -0.760
## `eduMaster's degree`                        0.07388    0.13728   0.538
## `eduProfessional degree (MD, JD, etc.)`     0.27318    0.06234   4.382
## `eduHigh School`                           -0.16018    0.04505  -3.555
## `eduSome college`                          -0.29042    0.08046  -3.610
## eduPhD                                      0.17613    0.06375   2.763
## genderMan                                   0.13229    0.06792   1.948
## genderWoman                                -0.09381    0.07191  -1.305
## `genderNon-binary`                         -0.15825    0.03789  -4.177
## `indust_grpInformation Services`           -0.09326    0.02482  -3.757
## `indust_grpManufacturing and Construction`  0.13898    0.02319   5.993
## `indust_grpHuman Services`                 -0.32122    0.02444 -13.145
## `indust_grpRaw Materials`                  -0.03790    0.02038  -1.860
## add_comp_flg1                               0.23630    0.02104  11.231
##                                                        Pr(>|z|)    
## (Intercept)                                            0.286903    
## job_titleengineer                          < 0.0000000000000002 ***
## job_titlelibrarian                            0.000000000012970 ***
## job_titledirector                             0.000000126594756 ***
## job_titlesoftware                          < 0.0000000000000002 ***
## job_titleadministrative                       0.000000000000234 ***
## yrs_in_field_exp.L                            0.000000000007729 ***
## yrs_in_field_exp.Q                                     0.004406 ** 
## yrs_in_field_exp.C                                     0.443968    
## `yrs_in_field_exp^4`                                   0.480004    
## `yrs_in_field_exp^5`                                   0.608689    
## `yrs_in_field_exp^6`                                   0.785361    
## `yrs_in_field_exp^7`                                   0.198802    
## `eduCollege degree`                                    0.447415    
## `eduMaster's degree`                                   0.590471    
## `eduProfessional degree (MD, JD, etc.)`       0.000011766271969 ***
## `eduHigh School`                                       0.000377 ***
## `eduSome college`                                      0.000307 ***
## eduPhD                                                 0.005727 ** 
## genderMan                                              0.051446 .  
## genderWoman                                            0.192011    
## `genderNon-binary`                            0.000029583860942 ***
## `indust_grpInformation Services`                       0.000172 ***
## `indust_grpManufacturing and Construction`    0.000000002058367 ***
## `indust_grpHuman Services`                 < 0.0000000000000002 ***
## `indust_grpRaw Materials`                              0.062889 .  
## add_comp_flg1                              < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16414  on 11839  degrees of freedom
## Residual deviance: 13643  on 11813  degrees of freedom
## AIC: 13697
## 
## Number of Fisher Scoring iterations: 5

If we check assumptions further by taking a look at our residual graphs we again do not see much of a change. The information remains that: We can see from the residuals vs fitted graph that the residuals for the two classes are able to be separated into two distinct lines. Which is a good sign showing that our model is achieving separability between the classes fairly well. Taking a look at the Q-Q residual plot we see that the distribution falls into line for a theoretical normal distribution fairly well, an indication that our model is not facing any issues with completely disparate distributions. Looking at the residuals vs leverage plot we are able to reconfirm that we don’t have any outliers that negatively affect our model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1776 1362
##       high  760 2940
##                                                
##                Accuracy : 0.6897               
##                  95% CI : (0.6786, 0.7006)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3659               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.7003               
##             Specificity : 0.6834               
##          Pos Pred Value : 0.5660               
##          Neg Pred Value : 0.7946               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2597               
##    Detection Prevalence : 0.4589               
##       Balanced Accuracy : 0.6919               
##                                                
##        'Positive' Class : low                  
## 

Now if we take a look at our predictions against the test set, we see an overall accuracy of 69.0%, which is a hit of 3% from our previous model. The 3% difference is not a big deal this time because of the fact that we see a sensitivity of 70% which has a 19% gain. Meaning that our model is now actually predicting those that are in the low income class rather than just guessing. Our specificity of 68.3% is a 16% drop, but that’s simply the price that needs to be paid for balancing our accuracy with the predictors we are using. Looking at balanced accuracy, we get 69.2% which is an increase of 1.4% which means that this model is on track to beating our original model and still using less terms.

Balanced - Elastic Net Logistic Regression

We attempt improving our model further through regularization. Regularization is a process that can be attempted in order to optimize generalized linear models by better selecting terms and keeping coefficients lower. In this case we will be using the glmnet package for its elastic net logistic regression model. With elastic net regression we are essentially combining lasso and ridge regression by building a minimization equation to be solved that takes into account both a lambda parameter and an alpha parameter. The lambda parameter essentially selects terms while the alpha parameter keeps coefficients small in magnitude.

We utilize a grid search on 5 different values of alpha and lambda to fit the best elastic net model to our cross-validated and balanced training dataset. However, we end up with a surprise in that the optimal values seem to just be not using regularization. Our literature review did hint that we would not get much out of regularization if we didn’t have an extremely large number of quantitative variables, but our training ROC ends up improving from our generalized linear model with an almost even sensitivity and specificity.

## glmnet 
## 
## 11840 samples
##     6 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  ROC        Sens       Spec     
##   0.0    0.0     0.7610787  0.6871622  0.6841216
##   0.0    0.2     0.7566333  0.6944257  0.6701014
##   0.0    0.4     0.7527175  0.6988176  0.6633446
##   0.0    0.6     0.7502282  0.6956081  0.6652027
##   0.0    0.8     0.7483343  0.6883446  0.6663851
##   0.0    1.0     0.7470190  0.6876689  0.6658784
##   0.2    0.0     0.7607919  0.6893581  0.6802365
##   0.2    0.2     0.7228232  0.6582770  0.6650338
##   0.2    0.4     0.6635677  0.6630068  0.6043919
##   0.2    0.6     0.6635663  0.6630068  0.6043919
##   0.2    0.8     0.5000000  0.8000000  0.2000000
##   0.2    1.0     0.5000000  0.8000000  0.2000000
##   0.4    0.0     0.7607952  0.6890203  0.6802365
##   0.4    0.2     0.6635663  0.6630068  0.6043919
##   0.4    0.4     0.5000000  0.8000000  0.2000000
##   0.4    0.6     0.5000000  0.8000000  0.2000000
##   0.4    0.8     0.5000000  0.8000000  0.2000000
##   0.4    1.0     0.5000000  0.8000000  0.2000000
##   0.6    0.0     0.7607762  0.6890203  0.6802365
##   0.6    0.2     0.6635663  0.6630068  0.6043919
##   0.6    0.4     0.5000000  0.8000000  0.2000000
##   0.6    0.6     0.5000000  0.8000000  0.2000000
##   0.6    0.8     0.5000000  0.8000000  0.2000000
##   0.6    1.0     0.5000000  0.8000000  0.2000000
##   0.8    0.0     0.7607942  0.6858108  0.6836149
##   0.8    0.2     0.5000000  0.8000000  0.2000000
##   0.8    0.4     0.5000000  0.8000000  0.2000000
##   0.8    0.6     0.5000000  0.8000000  0.2000000
##   0.8    0.8     0.5000000  0.8000000  0.2000000
##   0.8    1.0     0.5000000  0.8000000  0.2000000
##   1.0    0.0     0.7608033  0.6876689  0.6817568
##   1.0    0.2     0.5000000  0.8000000  0.2000000
##   1.0    0.4     0.5000000  0.8000000  0.2000000
##   1.0    0.6     0.5000000  0.8000000  0.2000000
##   1.0    0.8     0.5000000  0.8000000  0.2000000
##   1.0    1.0     0.5000000  0.8000000  0.2000000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0 and lambda = 0.

If we take a look at the ROC plot of the different parameters where the mixing percentage is alpha and the regularization is lambda. We see that trying to regularize our model coefficients just makes the ROC worse. This is because we are taking away from the few variables that we have already, leaving us with less information overall.

We do not have a way to check the residual based assumptions for our elastic net model unfortunately, but since creating an elasticnet model with the hyperparameters set to 0 is essentially the same as creating a glm logit model, our assumptions being previously satisfied should suffice.

Looking at variable importance we see that years in field experience has suddenly rocketed itself to the top importance spot and the quadratic and cubic terms for the same category is now in the top 20 of variable importance. The rest of the important variables seem relatively unchanged. This is a hint that the glmnet library which minimizes the elastic net equation is better equipped to handle ordered factors when building a model compared to the glm library. As there should be no differences in the end result of an elastic net model with the hyperparameters set to 0 and the generalized logit model if both are properly calculated.

##                                          coef.glm4.finalModel....position.
## (Intercept)                                                    0.015516001
## job_titleengineer                                              0.230349496
## job_titlelibrarian                                            -0.139537857
## job_titledirector                                              0.108951063
## job_titlesoftware                                              0.371545904
## job_titleadministrative                                       -0.205710205
## yrs_in_field_exp.L                                             0.377711658
## yrs_in_field_exp.Q                                            -0.233402075
## yrs_in_field_exp.C                                            -0.088545759
## yrs_in_field_exp^4                                             0.045216396
## yrs_in_field_exp^5                                             0.018447533
## yrs_in_field_exp^6                                             0.007052612
## yrs_in_field_exp^7                                             0.026124725
## eduCollege degree                                             -0.084469220
## eduMaster's degree                                             0.083592208
## eduProfessional degree (MD, JD, etc.)                          0.257451751
## eduHigh School                                                -0.138477047
## eduSome college                                               -0.250643643
## eduPhD                                                         0.168759905
## genderMan                                                      0.135506883
## genderWoman                                                   -0.078966947
## genderNon-binary                                              -0.139860277
## indust_grpInformation Services                                -0.068711416
## indust_grpManufacturing and Construction                       0.135554932
## indust_grpHuman Services                                      -0.287923336
## indust_grpRaw Materials                                       -0.033094510
## add_comp_flg1                                                  0.220647701

We can see the coefficients for our logit equation which are still very similar to the previous model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1761 1323
##       high  775 2979
##                                                
##                Accuracy : 0.6932               
##                  95% CI : (0.6821, 0.7041)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3704               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.6944               
##             Specificity : 0.6925               
##          Pos Pred Value : 0.5710               
##          Neg Pred Value : 0.7936               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2575               
##    Detection Prevalence : 0.4510               
##       Balanced Accuracy : 0.6934               
##                                                
##        'Positive' Class : low                  
## 

Now if we take a look at our predictions against the test set, we see an overall accuracy of 69.3%, which is a gain of 0.3% from our previous model. The 0.3% difference is not substantial but is the first increase in accuracy we’ve seen from our model iterations. We see a sensitivity of 69.4% which has a 0.6% drop that is not so bad. Specifically because our specificity of 69.3% is a 1% increase. So we have corrected back in the direction of specificity after tuning for a better sensitivity. Looking at balanced accuracy, we get 69.3% which is an increase of 0.1% which is almost equivalent to the balanced accuracy of our original model but with much more balance between sensitivity and specificity. This is likely the best state that our iteration of a logistic regression model is going to get.

Balanced - XGBoost Decision Tree Ensemble

Our literature review indicated that a decision tree-based model could end up beating a logistic regression model for classifying a similar set of data. Thus, it made sense to train a tree-based model and compare it to our logistic regression model. XGBoost is an ensemble of multiple decision-trees that slowly builds itself up through repeated iterations of tree generations. It is typically used as the gold standard for predicting with tree-based models. Which is why we chose it.

We utilize the same terms used for the previous model and preprocessing for the balanced data set. We allow caret to automatically tune the xgboost model through its default parameter grid. Although if we chose to we would be able to customize the learning rate, amount of iterations per xgboost, the max amount of nodes per tree, and how the weights changed each new tree.

## eXtreme Gradient Boosting 
## 
## 11840 samples
##     6 predictor
##     2 classes: 'low', 'high' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9472, 9472, 9472, 9472, 9472 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  ROC        Sens     
##   0.3  1          0.6               0.50        50      0.7552816  0.7021959
##   0.3  1          0.6               0.50       100      0.7605146  0.6908784
##   0.3  1          0.6               0.50       150      0.7609143  0.6976351
##   0.3  1          0.6               0.75        50      0.7547668  0.7040541
##   0.3  1          0.6               0.75       100      0.7596938  0.6875000
##   0.3  1          0.6               0.75       150      0.7608657  0.6922297
##   0.3  1          0.6               1.00        50      0.7545677  0.6986486
##   0.3  1          0.6               1.00       100      0.7597536  0.6967905
##   0.3  1          0.6               1.00       150      0.7607919  0.6930743
##   0.3  1          0.8               0.50        50      0.7556303  0.6907095
##   0.3  1          0.8               0.50       100      0.7603698  0.6858108
##   0.3  1          0.8               0.50       150      0.7607135  0.6902027
##   0.3  1          0.8               0.75        50      0.7557945  0.6984797
##   0.3  1          0.8               0.75       100      0.7600420  0.6927365
##   0.3  1          0.8               0.75       150      0.7609292  0.6908784
##   0.3  1          0.8               1.00        50      0.7552184  0.7008446
##   0.3  1          0.8               1.00       100      0.7598273  0.6971284
##   0.3  1          0.8               1.00       150      0.7608261  0.6929054
##   0.3  2          0.6               0.50        50      0.7662898  0.7146959
##   0.3  2          0.6               0.50       100      0.7678814  0.7250000
##   0.3  2          0.6               0.50       150      0.7688095  0.7179054
##   0.3  2          0.6               0.75        50      0.7651401  0.7121622
##   0.3  2          0.6               0.75       100      0.7689747  0.7195946
##   0.3  2          0.6               0.75       150      0.7706271  0.7206081
##   0.3  2          0.6               1.00        50      0.7657800  0.7052365
##   0.3  2          0.6               1.00       100      0.7698523  0.7204392
##   0.3  2          0.6               1.00       150      0.7704921  0.7250000
##   0.3  2          0.8               0.50        50      0.7669818  0.7167230
##   0.3  2          0.8               0.50       100      0.7689539  0.7275338
##   0.3  2          0.8               0.50       150      0.7694311  0.7224662
##   0.3  2          0.8               0.75        50      0.7653162  0.7072635
##   0.3  2          0.8               0.75       100      0.7698374  0.7202703
##   0.3  2          0.8               0.75       150      0.7696274  0.7263514
##   0.3  2          0.8               1.00        50      0.7652494  0.7062500
##   0.3  2          0.8               1.00       100      0.7683999  0.7079392
##   0.3  2          0.8               1.00       150      0.7700722  0.7248311
##   0.3  3          0.6               0.50        50      0.7678943  0.7187500
##   0.3  3          0.6               0.50       100      0.7677630  0.7077703
##   0.3  3          0.6               0.50       150      0.7667262  0.7175676
##   0.3  3          0.6               0.75        50      0.7690694  0.7192568
##   0.3  3          0.6               0.75       100      0.7685436  0.7153716
##   0.3  3          0.6               0.75       150      0.7679639  0.7126689
##   0.3  3          0.6               1.00        50      0.7682852  0.7201014
##   0.3  3          0.6               1.00       100      0.7696784  0.7207770
##   0.3  3          0.6               1.00       150      0.7688302  0.7125000
##   0.3  3          0.8               0.50        50      0.7692224  0.7216216
##   0.3  3          0.8               0.50       100      0.7680949  0.7282095
##   0.3  3          0.8               0.50       150      0.7677084  0.7128378
##   0.3  3          0.8               0.75        50      0.7688031  0.7238176
##   0.3  3          0.8               0.75       100      0.7686538  0.7136824
##   0.3  3          0.8               0.75       150      0.7667385  0.7091216
##   0.3  3          0.8               1.00        50      0.7688489  0.7184122
##   0.3  3          0.8               1.00       100      0.7691022  0.7146959
##   0.3  3          0.8               1.00       150      0.7681458  0.7133446
##   0.4  1          0.6               0.50        50      0.7586489  0.6875000
##   0.4  1          0.6               0.50       100      0.7605271  0.7010135
##   0.4  1          0.6               0.50       150      0.7607509  0.6978041
##   0.4  1          0.6               0.75        50      0.7578403  0.7011824
##   0.4  1          0.6               0.75       100      0.7609543  0.6878378
##   0.4  1          0.6               0.75       150      0.7609404  0.6983108
##   0.4  1          0.6               1.00        50      0.7581749  0.7033784
##   0.4  1          0.6               1.00       100      0.7607537  0.6902027
##   0.4  1          0.6               1.00       150      0.7610006  0.6944257
##   0.4  1          0.8               0.50        50      0.7591057  0.6956081
##   0.4  1          0.8               0.50       100      0.7605198  0.6935811
##   0.4  1          0.8               0.50       150      0.7602740  0.6993243
##   0.4  1          0.8               0.75        50      0.7582482  0.7052365
##   0.4  1          0.8               0.75       100      0.7611683  0.6885135
##   0.4  1          0.8               0.75       150      0.7608301  0.6954392
##   0.4  1          0.8               1.00        50      0.7573436  0.7064189
##   0.4  1          0.8               1.00       100      0.7605611  0.6913851
##   0.4  1          0.8               1.00       150      0.7609553  0.6947635
##   0.4  2          0.6               0.50        50      0.7670589  0.7135135
##   0.4  2          0.6               0.50       100      0.7698872  0.7190878
##   0.4  2          0.6               0.50       150      0.7683040  0.7094595
##   0.4  2          0.6               0.75        50      0.7668609  0.7204392
##   0.4  2          0.6               0.75       100      0.7699119  0.7261824
##   0.4  2          0.6               0.75       150      0.7694118  0.7155405
##   0.4  2          0.6               1.00        50      0.7666529  0.7104730
##   0.4  2          0.6               1.00       100      0.7695504  0.7287162
##   0.4  2          0.6               1.00       150      0.7704656  0.7197635
##   0.4  2          0.8               0.50        50      0.7658190  0.7074324
##   0.4  2          0.8               0.50       100      0.7681632  0.7246622
##   0.4  2          0.8               0.50       150      0.7672664  0.7125000
##   0.4  2          0.8               0.75        50      0.7662365  0.7146959
##   0.4  2          0.8               0.75       100      0.7703771  0.7179054
##   0.4  2          0.8               0.75       150      0.7690212  0.7141892
##   0.4  2          0.8               1.00        50      0.7673982  0.7148649
##   0.4  2          0.8               1.00       100      0.7697914  0.7332770
##   0.4  2          0.8               1.00       150      0.7702436  0.7207770
##   0.4  3          0.6               0.50        50      0.7673364  0.7202703
##   0.4  3          0.6               0.50       100      0.7672434  0.7038851
##   0.4  3          0.6               0.50       150      0.7653909  0.7126689
##   0.4  3          0.6               0.75        50      0.7678757  0.7192568
##   0.4  3          0.6               0.75       100      0.7674875  0.7143581
##   0.4  3          0.6               0.75       150      0.7659122  0.7162162
##   0.4  3          0.6               1.00        50      0.7686855  0.7243243
##   0.4  3          0.6               1.00       100      0.7682789  0.7170608
##   0.4  3          0.6               1.00       150      0.7670825  0.7153716
##   0.4  3          0.8               0.50        50      0.7665067  0.7192568
##   0.4  3          0.8               0.50       100      0.7669026  0.7177365
##   0.4  3          0.8               0.50       150      0.7664904  0.7167230
##   0.4  3          0.8               0.75        50      0.7688089  0.7241554
##   0.4  3          0.8               0.75       100      0.7676189  0.7170608
##   0.4  3          0.8               0.75       150      0.7661925  0.7153716
##   0.4  3          0.8               1.00        50      0.7689255  0.7189189
##   0.4  3          0.8               1.00       100      0.7676264  0.7128378
##   0.4  3          0.8               1.00       150      0.7663954  0.7097973
##   Spec     
##   0.6608108
##   0.6858108
##   0.6756757
##   0.6547297
##   0.6866554
##   0.6815878
##   0.6601351
##   0.6805743
##   0.6814189
##   0.6717905
##   0.6868243
##   0.6847973
##   0.6626689
##   0.6844595
##   0.6817568
##   0.6560811
##   0.6800676
##   0.6819257
##   0.6777027
##   0.6655405
##   0.6699324
##   0.6785473
##   0.6692568
##   0.6679054
##   0.6822635
##   0.6711149
##   0.6677365
##   0.6726351
##   0.6660473
##   0.6680743
##   0.6792230
##   0.6672297
##   0.6579392
##   0.6827703
##   0.6812500
##   0.6663851
##   0.6711149
##   0.6777027
##   0.6719595
##   0.6660473
##   0.6766892
##   0.6736486
##   0.6668919
##   0.6667230
##   0.6751689
##   0.6646959
##   0.6625000
##   0.6736486
##   0.6648649
##   0.6731419
##   0.6684122
##   0.6731419
##   0.6717905
##   0.6734797
##   0.6836149
##   0.6731419
##   0.6783784
##   0.6702703
##   0.6854730
##   0.6755068
##   0.6603041
##   0.6858108
##   0.6783784
##   0.6677365
##   0.6802365
##   0.6741554
##   0.6591216
##   0.6834459
##   0.6780405
##   0.6581081
##   0.6841216
##   0.6785473
##   0.6765203
##   0.6701014
##   0.6780405
##   0.6690878
##   0.6631757
##   0.6714527
##   0.6748311
##   0.6594595
##   0.6675676
##   0.6809122
##   0.6657095
##   0.6758446
##   0.6746622
##   0.6682432
##   0.6763514
##   0.6702703
##   0.6621622
##   0.6695946
##   0.6636824
##   0.6810811
##   0.6722973
##   0.6633446
##   0.6677365
##   0.6648649
##   0.6645270
##   0.6707770
##   0.6719595
##   0.6652027
##   0.6694257
##   0.6722973
##   0.6643581
##   0.6729730
##   0.6711149
##   0.6716216
##   0.6775338
##   0.6766892
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 2, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 0.75.

We were able to find the highest training ROC yet with the optimally tuned parameters from our search at 0.771.

Visualizing how our ROC changes over iteration by hyperparameter is an interesting exercise that lets us see with our optimal parameters, the ROC increases linearly every iteration which is not the case for most other ROC developments over iteration.

Unfortunately due to the nature of XGBoost we can not see coefficients for each term, nor do we even have an equation in the first place. We are still able to see the variable importance for XGBoost and we see some interesting changes in magnitude. However, we can not see if these variables are negatively or positively contributing to chance to be above or below median income. The linear component of years in field experience is now almost four times the next most important variable of working in the human services industry. This hints that the boost in importance to our years in field experience variable when switching to an elastic net model was justified.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  1871 1460
##       high  665 2842
##                                                
##                Accuracy : 0.6892               
##                  95% CI : (0.6781, 0.7002)     
##     No Information Rate : 0.6291               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3743               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.7378               
##             Specificity : 0.6606               
##          Pos Pred Value : 0.5617               
##          Neg Pred Value : 0.8104               
##              Prevalence : 0.3709               
##          Detection Rate : 0.2736               
##    Detection Prevalence : 0.4871               
##       Balanced Accuracy : 0.6992               
##                                                
##        'Positive' Class : low                  
## 

Now if we take a look at our predictions against the test set, we see an overall accuracy of 69.4%, which is a gain of 0.1% from our previous model. The 0.1% difference is not substantial but is an increase. We see a sensitivity of 71.2% which is a 1.8% increase that signals the XGBoost model is tending towards boosting sensitivity again. Our specificity of 68.0% is a 1.3% decrease. Looking at balanced accuracy, we get 69.9% which is an increase of 0.6% which beats the balanced accuracy of our original model but with much more balance between sensitivity and specificity. This is our best model in everything but interpretability.


Select Models

Regression Models

With our regression models built, we want to see which one has the best performance for our purposes.

In this case both models have similar (low) performance and so we want to use the second model which accomplishes similar prediction results but with fewer variables.

Classification Models

With our classification models built, we want to see which one has the best performance for our purposes. We are evaluating the models that we have built on a combination of interpretability, balanced accuracy, and the ratio between sensitivity and specificity.

evalm %>% 
  mutate(
    across(
      c(Accuracy:`Balanced Accuracy`),
      ~round(as.numeric(.x)
             , 4)
    )
  ) %>% 
  arrange(desc(`Balanced Accuracy`)) %>% 
  knitr::kable()
Model Accuracy Sensitivity Specificity Balanced Accuracy
XGBoost 0.6892 0.7378 0.6606 0.6992
Unbalanced Simple Logistic Regression 1 0.7331 0.5406 0.8466 0.6936
Elastic Net Logistic Regression 0.6932 0.6944 0.6925 0.6934
Balanced Simple Logistic Regression 0.6897 0.7003 0.6834 0.6919
Unbalanced Simple Logistic Regression 2 0.7197 0.5118 0.8422 0.6770

If we sort by balanced accuracy then we can see XGBoost beating the rest of our models. Which provides validity to the assertion that tree-based models are good at classifying income based data. Yet, this model is disqualified from being an optimal model for anything but prediction. If we want to glean information about which variable contributes to having an income above or below the median we would not able to tell which direction the variable is contributing in. That is why within the same percentage point of balanced accuracy we defer to the logistic regression models.

The next highest balanced accuracy is with the original unbalanced simple logistic regression model. However, we know that balanced accuracy does not tell the whole story despite being calculated with sensitivity and specificity. Our sensitivity is too low to have the model be considered as actually predicting any of the low income classes.

Therefore, we deem the best model to use as our elastic net model that utilizes a balanced training dataset. It is within a margin of ten-thousandths from the balanced accuracy of the simple logistic regression model while having much more balanced sensitivity and specificity metrics.


Conclusion

For our linear models, we were able to glean interesting relationships between our data and total compensation. It’s possible that by transforming our target variable, maybe with a natural log transformation we might have seen an improvement in results. Tasks for future exploration could be to look at how similar data collected 10 years ago has differences in coefficients and how that could spark a narrative to explore changing relationships in the work space, for example the relative gain or loss in economic stature of different States.

For our logistic models, considering the fact that we satisfied all the assumptions utilizing the dataset that we had, it is very likely that there is no problem with model choice. We got close to the best results for our dataset. The problem lies within our dataset itself. There were not enough categories that could reliably contribute to predicting a higher or lower than median income model. Perhaps, if there was a question that asked how they believed their income compared to their peers we would be able to make much more accurate predictions. Additionally, we could have possibly extracted more value from the messier columns such as job titles and job context fields, however the process would be painstaking.

A new survey with questions that only have pre-allowed responses to extract the most predictive power from each question would be the ideal way to extend this attempt at predicting incomes and income classification.


Code Appendix

library(tidyverse)
library(skimr)      #skim
library(gdata)      #trim
library(ggpubr)     #ggarrange
library(scales)     #formatting numbers in ggplot axis 
library(cowplot)    #density chart
library(outliers)   #scores
library(qdapTools)  #mtabulate
library(ggcorrplot)   #corrplot for categorical data
library(correlationfunnel)
library(usmap)
library(caret) #the all in one model fitting package
library(dplyr)
#Loading Data
sal_df <- as.data.frame(read.csv('https://raw.githubusercontent.com/dianaplunkett/621/main/Ask%20A%20Manager%20Salary%20Survey%202021%20(Responses).csv'))
#load in the per capita income data.
state_inc<-as.data.frame(read.csv("https://raw.githubusercontent.com/dianaplunkett/621/main/PersonalIncomePerCapita.csv"))
names(sal_df)<-c('timestamp','age','industry','job_title', 'title_context','annual_sal','add_comp','ccy','other_ccy','income_context','ctry','state','city','yrs_wrk_exp','yrs_in_field_exp','edu','gender', 'race')

names(state_inc)<-c('state','PIncPerCap')
### Data Preprocessing 
glimpse(sal_df)
#empty strings to NA
sal_df[sal_df == ''] <- NA
#make factors
sal_df |>
  mutate(
    across(
      c(age, yrs_wrk_exp, yrs_in_field_exp, edu, gender),
      as_factor
    )
  )  -> sal_df

sal_df |>
  dplyr::select(where(is.factor)) |>
  sapply(levels)
#order factors
sal_df |>
  mutate(
    age = ordered(age, 
                  c("under 18", 
                    "18-24", 
                    "25-34", 
                    "35-44", 
                    "45-54", 
                    "55-64", 
                    "65 or over")),
    yrs_wrk_exp = ordered(yrs_wrk_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    yrs_in_field_exp = ordered(yrs_in_field_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    edu = ordered(edu, 
                    c("High School", 
                      "Some college", 
                      "College degree", 
                      "Master's degree", 
                      "PhD", 
                      "Professional degree (MD, JD, etc.)")),
  ) -> sal_df
#create a new df sal_US with only rows where state != NA
sal_US <- sal_df[!is.na(sal_df$state),]
#remove the ctry column, all rows are now US
sal_US = subset(sal_US, select = -c(ctry) )
#unique state shows 132 different values, just showing tail for clarity. 
tail(unique(sal_US$state))
#how many rows have a ',', eg multiple states
length(which(grepl(',', sal_US$state)))
#delete the rows with multiple states
sal_US <- sal_US[!grepl(',', sal_US$state),]
#merge rows 
sal_US <- merge(sal_US, state_inc, by ='state', all.x=TRUE)

head(sal_US |> distinct(state,PIncPerCap))
#sum the annual_sal and add_comp
sal_US$total_comp <- rowSums(sal_US[, c('annual_sal','add_comp')], na.rm=TRUE)

#create the hi_comp field
sal_US <- sal_US |> mutate(hi_comp=if_else(total_comp>PIncPerCap,1,0))
sal_US$hi_comp <- as_factor(sal_US$hi_comp)

head(sal_US |> select (annual_sal, add_comp,total_comp,PIncPerCap, hi_comp))
##dmp added 12/2
theColors <- c('#6aaec8','#52985e')

ggplot(sal_US, aes(x=hi_comp, fill=hi_comp)) +
  geom_bar (aes(y = (after_stat(count))/sum(after_stat(count)))) +
  ylab("Percent")+
  scale_fill_manual(values=theColors)+
    theme_classic()
#Data Skimming
skim(sal_US)
head(unique(sal_US$industry))
head(sal_US$job_title[is.na(sal_US$industry)])
head(unique(sal_US$title_context))
table(sal_US$ccy)
tail (unique(sal_US$other_ccy))
head(unique(sal_US$income_context))
head(unique(sal_US$city))
#### Demographic variables
unique(sal_US[(!grepl(',', sal_US$race)), c('race')])
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
pa <- sal_US |> 
    ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pb <- sal_US |> 
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pc <- sal_US |> 
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
pd <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(pa,pb,pc,pd)
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
p1 <- sal_US |> 
    filter(annual_sal < 150000) |>
  ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p2 <- sal_US |> 
    filter(add_comp < 150000) |>
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p3 <- sal_US |> 
    filter(total_comp < 150000) |>
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
p4 <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(p1,p2,p3,p4)
theColors <- c('#6aaec8','#52985e')

p5<-ggplot(sal_US, aes(fill=hi_comp, x=age), axis.title.y=element_blank()) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Age")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p6<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_wrk_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years Work Exp")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p7<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_in_field_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years In Field")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p8<-ggplot(sal_US, aes(fill=hi_comp, x=edu)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Education")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p9<-ggplot(sal_US, aes(fill=hi_comp, x=gender)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Gender")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank(), legend.title=element_text(size=8))

ggarrange(p5,p6,p7,p8,p9, nrow=3, ncol = 2)
# Target Amount Breakdown
# density plots
multdens <- function(df, var, target) {
  
  df |> 
    dplyr::select(annual_sal, add_comp, total_comp, hi_comp) |> 
    filter(!is.na(.data[[var]])) |> filter(total_comp < 150000) |>
    ggplot(aes(.data[[var]], fill = .data[[target]], color = .data[[target]])) +
      geom_density(alpha = 0.4) +
    scale_fill_manual(values=theColors)+
    scale_color_manual(values=theColors)+
      labs(y = NULL) +
      theme_minimal(base_size = 6) +
      theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}

num_col <- sal_US |>
  dplyr::select(annual_sal, add_comp, total_comp) |>
  colnames()

dens <- lapply(num_col, multdens, df = sal_US, target = "hi_comp")

plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)
# map of per capita personal income 
map<-plot_usmap(data= state_inc, values ='PIncPerCap', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Personal Income Per Capita', label = scales::comma) + 
  theme(legend.position = "right")

# Set label font size
map$layers[[2]]$aes_params$size <- 1.5
print(map)
#first we want to gather the data to map
occ<-as.data.frame(table(sal_US$state))
names(occ)<-c('state','responses')
#map of responses by state
map_2<-plot_usmap(data= occ, values ='responses', color = '#52985e', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#03688f',name = 'Responses', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_2$layers[[2]]$aes_params$size <- 1.5
print(map_2)
#first we want to gather the data to map
avg<-sal_US |>
  group_by(state) |>
  summarise_at(vars(total_comp), list(name = mean))
names(avg)<-c('state','avg_comp')
#map of responses by state
map_3<-plot_usmap(data= avg, values ='avg_comp', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Mean Total Comp', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_3$layers[[2]]$aes_params$size <- 1.5
print(map_3)
# Data Preparation
# new data frame for data prep 
sal_US1 <- sal_US
#correct the others that are really USD
sal_US1 <- sal_US1 |>
  mutate (ccy = ifelse(ccy=='Other' & (other_ccy=='USD' | other_ccy == "US Dollar" | other_ccy == "American Dollars"),'USD', ccy))

#Keep only the rows where ccy=USD
sal_US1 <- sal_US1[sal_US1$ccy =='USD',]

table(sal_US1$ccy)
#remove ccy anc other_ccy columns as now all USD
sal_US1 = subset(sal_US1, select = -c(ccy, other_ccy))
#remove city column
sal_US1 = subset(sal_US1, select = -c(city))
#make indust_grp field
sal_US1 <- sal_US1 %>% mutate (indust_grp = 
            case_when (
#Raw Materials
              grepl('Forest', industry) ~ 'Raw Materials',
              grepl('Mining', industry) ~ 'Raw Materials',
              grepl('Oil', industry) ~ 'Raw Materials',
              grepl('oil', industry) ~ 'Raw Materials',
              grepl('Energy', industry) ~ 'Raw Materials',
              grepl('energy', industry) ~ 'Raw Materials',
              grepl('Petroleum', industry) ~ 'Raw Materials',
              grepl('Agriculture', industry) ~ 'Raw Materials',              grepl('Fisher', industry) ~ 'Raw Materials',
              
#Manufacturing and Construction             
              grepl('Construction', industry) ~ 'Manufacturing and Construction',
              grepl('Contracting', industry) ~ 'Manufacturing and Construction',
              grepl('Manufac', industry) ~ 'Manufacturing and Construction',
              grepl('MANUFAC', industry) ~ 'Manufacturing and Construction',
              grepl('manufac', industry) ~ 'Manufacturing and Construction',
              grepl('Mfg', industry) ~ 'Manufacturing and Construction',
              grepl('coat', industry) ~ 'Manufacturing and Construction',
              grepl('Wine', industry) ~ 'Manufacturing and Construction',              
              grepl('Packaged Goods', industry) ~ 'Manufacturing and Construction',
            grepl('packaged goods', industry) ~ 'Manufacturing and Construction',
            grepl('Goods', industry) ~ 'Manufacturing and Construction',
            grepl('Consumer', industry) ~ 'Manufacturing and Construction',
            grepl('consumer', industry) ~ 'Manufacturing and Construction',
            grepl('Commercial furniture', industry) ~ 'Manufacturing and Construction',
            grepl('Industrial', industry) ~ 'Manufacturing and Construction',
            grepl('Craft Beer Industry', industry) ~ 'Manufacturing and Construction',
            grepl('Beverage', industry) ~ 'Manufacturing and Construction',
            grepl('Brew', industry) ~ 'Manufacturing and Construction',
            grepl('Chemicals', industry) ~ 'Manufacturing and Construction',
            grepl('Automotive', industry) ~ 'Manufacturing and Construction',
#Information Services             
              grepl('Educ', industry) ~ 'Information Services',
              grepl('educ', industry) ~ 'Information Services',
              grepl('acad', industry) ~ 'Information Services',
              grepl('learn', industry) ~ 'Information Services',
              grepl('School', industry) ~ 'Information Services',
              grepl('Acad', industry) ~ 'Information Services',
              grepl('College', industry) ~ 'Information Services',
              grepl('Tech', industry) ~ 'Information Services',
              grepl('tech', industry) ~ 'Information Services',
              grepl('Anal', industry) ~ 'Information Services',
              grepl('development', industry) ~ 'Information Services',

              grepl('Title IX', industry) ~ 'Information Services',
              grepl('Publication', industry) ~ 'Information Services',              
              grepl('edit', industry) ~ 'Information Services',     grepl('Publish', industry) ~ 'Information Services',
              grepl('publish', industry) ~ 'Information Services',
grepl('Ecology', industry) ~ 'Information Services', 
grepl('R&D', industry) ~ 'Information Services',              
grepl('Info', industry) ~ 'Information Services',
              grepl('Librar', industry) ~ 'Information Services',
              grepl('LIbrar', industry) ~ 'Information Services',
              grepl('librar', industry) ~ 'Information Services',
              grepl('LIBRAR', industry) ~ 'Information Services',

              grepl('Archivist', industry) ~ 'Information Services',
              grepl('Reporter', industry) ~ 'Information Services',
              grepl('Engineer', industry) ~ 'Information Services',
              grepl('engineer', industry) ~ 'Information Services',
              grepl('Copywriter', industry) ~ 'Information Services',
              grepl('Develop', industry) ~ 'Information Services',
              grepl('Edit', industry) ~ 'Information Services',
              grepl('Design', industry) ~ 'Information Services',
              grepl('Teach', industry) ~ 'Information Services',
              grepl('Train', industry) ~ 'Information Services',
              grepl('Instruct', industry) ~ 'Information Services',
              grepl('Research', industry) ~ 'Information Services',
              grepl('research', industry) ~ 'Information Services',
              grepl('Media', industry) ~ 'Information Services',
              grepl('STEM', industry) ~ 'Information Services',
              grepl('Think tank', industry) ~ 'Information Services',
              grepl('Marketing', industry) ~ 'Information Services',
              grepl('marketing', industry) ~ 'Information Services',
              grepl('Communica', industry) ~ 'Information Services',
              grepl('market research', industry) ~ 'Information Services',
              grepl('Data', industry) ~ 'Information Services',
              grepl('data', industry) ~ 'Information Services',
              grepl('content', industry) ~ 'Information Services',
              grepl('Content', industry) ~ 'Information Services',
              grepl('CPG', industry) ~ 'Information Services',
              grepl('IT', industry) ~ 'Information Services',
              grepl('Scien', industry) ~ 'Information Services',
              grepl('science', industry) ~ 'Information Services',
              grepl('Video', industry) ~ 'Information Services',
              grepl('Internet', industry) ~ 'Information Services',
            grepl('Archaeology', industry) ~ 'Information Services',
              grepl('Bio', industry) ~ 'Information Services',
              grepl('chemistry', industry) ~ 'Information Services',
              grepl('student', industry) ~ 'Information Services',
              grepl('Student', industry) ~ 'Information Services',
              grepl('Graduate assistant', industry) ~ 'Information Services',
              grepl('It security', industry) ~ 'Information Services',
              grepl('lab', industry) ~ 'Information Services',            grepl('Print', industry) ~ 'Information Services', 
            grepl('Chemistry', industry) ~ 'Information Services',
            grepl('Gaming', industry) ~ 'Information Services',
            grepl('Software', industry) ~ 'Information Services',
            grepl('Journalism', industry) ~ 'Information Services',
            grepl('journalism', industry) ~ 'Information Services',
            grepl('Work-Study', industry) ~ 'Information Services',
            grepl('Geospatial', industry) ~ 'Information Services',
            grepl('Virtual reality', industry) ~ 'Information Services',
            grepl('Survey', industry) ~ 'Information Services',
            
            grepl('PR', industry) ~ 'Information Services',
            grepl('Quality Assurance', industry) ~ 'Information Services',
#Human Services              
              grepl('Gov', industry) ~ 'Human Services',
              grepl('gov', industry) ~ 'Human Services',
              grepl('Volun', industry) ~ 'Human Services',
              grepl('Policy', industry) ~ 'Human Services',
              grepl('City', industry) ~ 'Human Services',
              grepl('Federal', industry) ~ 'Human Services',
              grepl('federal', industry) ~ 'Human Services',
              grepl('State', industry) ~ 'Human Services',
              grepl('Grant', industry) ~ 'Human Services',

              grepl('Nonprofit', industry) ~ 'Human Services',
              grepl('Non-profit', industry) ~ 'Human Services',
              grepl('Workforce', industry) ~ 'Human Services',
              grepl('Social Work', industry) ~ 'Human Services',
              grepl('Fundrais', industry) ~ 'Human Services',
              grepl('Environ', industry) ~ 'Human Services',
              grepl('environ', industry) ~ 'Human Services',
              grepl('Natural', industry) ~ 'Human Services',
              grepl('not for profit', industry) ~ 'Human Services',
              grepl('Non Profit', industry) ~ 'Human Services',
              grepl('Non profit', industry) ~ 'Human Services',
              grepl('Military', industry) ~ 'Human Services',
              grepl('Ministry', industry) ~ 'Human Services',
              grepl('Clergy', industry) ~ 'Human Services',
              grepl('Chaplain', industry) ~ 'Human Services',
              grepl('Foundation', industry) ~ 'Human Services',
              grepl('Religio', industry) ~ 'Human Services',
              grepl('Heritage', industry) ~ 'Human Services',
              grepl('AmeriCorps', industry) ~ 'Human Services',
              grepl('Intelligence', industry) ~ 'Human Services',
              grepl('Museum', industry) ~ 'Human Services',
              grepl('Zoo', industry) ~ 'Human Services',
              grepl('Philanthropy', industry) ~ 'Human Services',
              grepl('International development', industry) ~ 'Human Services',
              grepl('Historic Preservation', industry) ~ 'Human Services',
              grepl('Maritime', industry) ~ 'Human Services',
              grepl('Church', industry) ~ 'Human Services',
              grepl('Association', industry) ~ 'Human Services',
              grepl('association', industry) ~ 'Human Services',
              grepl('Politic', industry) ~ 'Human Services',
              grepl('politic', industry) ~ 'Human Services',
              grepl('Defense', industry) ~ 'Human Services',
              grepl('union', industry) ~ 'Human Services',
              grepl('Union', industry) ~ 'Human Services',
              grepl('Labor', industry) ~ 'Human Services',

              grepl('Funding', industry) ~ 'Human Services',
              grepl('Performing Arts', industry) ~ 'Human Services',
              grepl('Hosp', industry) ~ 'Human Services',
              grepl('parent', industry) ~ 'Human Services',

#Service              
              grepl('Law', industry) ~ 'Service',
              grepl('Legal', industry) ~ 'Service',
              grepl('attorney', industry) ~ 'Service',
              grepl('Psychol', industry) ~ 'Service',
              grepl('Counseling', industry) ~ 'Service',
              grepl('Doc', industry) ~ 'Service',
              grepl('Forensics', industry) ~ 'Service',
              grepl('Toxicology', industry) ~ 'Service',
              grepl('Medical', industry) ~ 'Service',
              grepl('Manage', industry) ~ 'Service',
              grepl('manage', industry) ~ 'Service',
              grepl('Admin', industry) ~ 'Service',
              grepl('Reception', industry) ~ 'Service',
              grepl('design', industry) ~ 'Service',
              grepl('Sale', industry) ~ 'Service',
              grepl('sale', industry) ~ 'Service',
              grepl('Buyer', industry) ~ 'Service',
              grepl('Bank', industry) ~ 'Service',
              grepl('finance', industry) ~ 'Service',
              grepl('trading', industry) ~ 'Service',
              grepl('Archi', industry) ~ 'Service',
              grepl('archi', industry) ~ 'Service',
              grepl('product development', industry) ~ 'Service',
              grepl('Equipment', industry) ~ 'Service',
              grepl('Health', industry) ~ 'Service',
              grepl('health', industry) ~ 'Service',
              grepl('Retail', industry) ~ 'Service',
              grepl('Tour', industry) ~ 'Service',
              grepl('Screening', industry) ~ 'Service',
              grepl('Insur', industry) ~ 'Service',
              grepl('Hospitality', industry) ~ 'Service',
              grepl('Consult', industry) ~ 'Service',
              grepl('consult', industry) ~ 'Service',
              grepl('Rideshare', industry) ~ 'Service',
              grepl('Transp', industry) ~ 'Service',
              grepl('Logist', industry) ~ 'Service',
              grepl('Distribution', industry) ~ 'Service',
              grepl('Deliver', industry) ~ 'Service',
              grepl('deliver', industry) ~ 'Service',
              grepl('sustainability', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('Planning', industry) ~ 'Service',      
              grepl('Recruit', industry) ~ 'Service',
              grepl('HR', industry) ~ 'Service',
              grepl('Equity', industry) ~ 'Service',
              grepl('Human', industry) ~ 'Service',
              grepl('Childcare', industry) ~ 'Service',
              grepl('Child care', industry) ~ 'Service',
              grepl('Child Care', industry) ~ 'Service',
              grepl('Telecom', industry) ~ 'Service',
              grepl('parking', industry) ~ 'Service',
              grepl('restoration', industry) ~ 'Service',
              grepl('Homemaker', industry) ~ 'Service',
              grepl('Caregiver', industry) ~ 'Service',

              grepl('Landscap', industry) ~ 'Service',
              grepl('landscap', industry) ~ 'Service',
              grepl('Repair', industry) ~ 'Service',
              grepl('repair', industry) ~ 'Service',
              grepl('Facilities', industry) ~ 'Service',

              grepl('engraving', industry) ~ 'Service',
              grepl('Call center', industry) ~ 'Service',
              grepl('Wholesale', industry) ~ 'Service',
              grepl('wholesale', industry) ~ 'Service',
              grepl('CALL', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('pharma', industry) ~ 'Service',
              grepl('Regulator', industry) ~ 'Service',
              grepl('Entertain', industry) ~ 'Service',
              grepl('Sport', industry) ~ 'Service',
              grepl('Music', industry) ~ 'Service',
              grepl('music', industry) ~ 'Service',
              grepl('I work for Indeed.com', industry) ~ 'Service',
              grepl('Interpret', industry) ~ 'Service',
              grepl('Funeral', industry) ~ 'Service',
              grepl('funeral', industry) ~ 'Service',
              grepl('Translation', industry) ~ 'Service',
              grepl('Concrete', industry) ~ 'Service',
              grepl('ecomm', industry) ~ 'Service',
              grepl('E-comm', industry) ~ 'Service',
              grepl('E-Comm', industry) ~ 'Service',
              grepl('e-comm', industry) ~ 'Service',
              grepl('E comm', industry) ~ 'Service',
              grepl('EComm', industry) ~ 'Service',
              grepl('Ecomm', industry) ~ 'Service',
              grepl('Pest', industry) ~ 'Service',
              grepl('Service', industry) ~ 'Service',
              grepl('service', industry) ~ 'Service',
              grepl('Cleaning', industry) ~ 'Service',
              grepl('Saas', industry) ~ 'Service',
              grepl('compli', industry) ~ 'Service',
              grepl('Assisting', industry) ~ 'Service',

              grepl('Apparel', industry) ~ 'Service',
              grepl('appraisal', industry) ~ 'Service',
              grepl('Compli', industry) ~ 'Service',
              grepl('Business', industry) ~ 'Service',
              grepl('Strategy', industry) ~ 'Service',
              grepl('Eap', industry) ~ 'Service',
              grepl('Investing', industry) ~ 'Service',
              grepl('Storage', industry) ~ 'Service',
              grepl('Supply', industry) ~ 'Service',
              grepl('Procur', industry) ~ 'Service',
              grepl('Airline', industry) ~ 'Service',
              grepl('Aerospace', industry) ~ 'Service',
              grepl('aviat', industry) ~ 'Service',
              grepl('Aviat', industry) ~ 'Service',
              grepl('Importing', industry) ~ 'Service',
              grepl('Staffing', industry) ~ 'Service',
              grepl('Cannabis', industry) ~ 'Service',
              grepl('warehous', industry) ~ 'Service',
              grepl('Warehous', industry) ~ 'Service',
              grepl('Rental', industry) ~ 'Service',
              grepl('Food', industry) ~ 'Service',
              grepl('food', industry) ~ 'Service',
              grepl('cook', industry) ~ 'Service',
              grepl('Beauty', industry) ~ 'Service',
              grepl('safety', industry) ~ 'Service',
              grepl('protect', industry) ~ 'Service',
  
              grepl('Operations', industry) ~ 'Service',
              grepl('Real Estate', industry) ~ 'Service',
              grepl('Real estate', industry) ~ 'Service',
              grepl('real estate', industry) ~ 'Service',
              grepl('Fitness', industry) ~ 'Service',
              grepl('Gym', industry) ~ 'Service',
              grepl('accessibility', industry) ~ 'Service',
              grepl('Accessibility', industry) ~ 'Service',
              grepl('Executive', industry) ~ 'Service',
              grepl('fitness', industry) ~ 'Service',
              grepl('Security', industry) ~ 'Service',
              grepl('Per Sit', industry) ~ 'Service',
              grepl('Pet', industry) ~ 'Service',
              grepl('Vet', industry) ~ 'Service',
              grepl('Animal', industry) ~ 'Service',
              grepl('restaurant', industry) ~ 'Service',
              grepl('Restaurant', industry) ~ 'Service',
              grepl('Auction', industry) ~ 'Service',
              grepl('Mortgage', industry) ~ 'Service',
              grepl('mortgage', industry) ~ 'Service',
              grepl('Tax', industry) ~ 'Service',
              grepl('Film', industry) ~ 'Service',

              grepl('Travel', industry) ~ 'Service',
              grepl('Gambl', industry) ~ 'Service',
              grepl('office', industry) ~ 'Service',
              grepl('Office', industry) ~ 'Service',
              grepl('recycling', industry) ~ 'Service',
              grepl('Fashion', industry) ~ 'Service',
              grepl('Finance', industry) ~ 'Service',
              grepl('coach', industry) ~ 'Service',
              grepl('Corporate', industry) ~ 'Service',
              grepl('camp', industry) ~ 'Service',            
              grepl('Cosmetology', industry) ~ 'Service',
              grepl('Purchasing', industry) ~ 'Service',
              grepl('Printing', industry) ~ 'Service',
              grepl('Global Mobility', industry) ~ 'Service',
              grepl('professional', industry) ~ 'Service',
              grepl('Horticulture', industry) ~ 'Service',
              grepl('Maintenance', industry) ~ 'Service',
              grepl('Janitorial', industry) ~ 'Service'
    ))
table(sal_US1$indust_grp)

sal_US1 <- sal_US1[!is.na(sal_US1$indust_grp),]
sal_US1 = subset(sal_US1, select = -c(industry))
#create the title_context_flg field
sal_US1 <- sal_US1 |>
  mutate(title_context_flg=if_else(!is.na(title_context),1,0))
sal_US1$title_context_flg <- as_factor(sal_US1$title_context_flg)
#create the income_context_flg field
sal_US1 <- sal_US1 |>
  mutate(income_context_flg=if_else(!is.na(income_context),1,0))
sal_US1$income_context_flg <- as_factor(sal_US1$income_context_flg)

sal_US1 = subset(sal_US1, select = -c(income_context, title_context))
table(sal_US1$title_context_flg)
table(sal_US1$income_context_flg)
sal_US1<-sal_US1[rowSums(is.na(sal_US1[, c("race", "edu", "gender")])) < 3,]
sal_US1 <- sal_US1 |>
  mutate(gender=if_else(grepl('refer', gender) | is.na(gender),"Prefer not to answer",gender))

unique(sal_US1$gender)
sal_US1 <- sal_US1 |>
  mutate(edu=if_else(is.na(edu),"Prefer not to answer",edu))

unique(sal_US1$edu)
sal_US1 <- sal_US1 |>
  mutate(race=if_else(is.na(race),"Another or prefer not to answer",race))

sal_US1 <- sal_US1 |>
  mutate(White=if_else(grepl('White', race),1,0)) |>
  mutate(AsianOrAsianAm=if_else(grepl('Asian', race),1,0)) |>
  mutate(NatAmOrAlaskaNat=if_else(grepl('Native', race),1,0)) |>
  mutate(BlackOrAficanAm=if_else(grepl('Black', race),1,0)) |>
  mutate(MiddleEorNAfrican=if_else(grepl('Middle', race),1,0))|>
  mutate(PreferNot=if_else(grepl('prefer', race),1,0)) |>
  mutate(HispLatSp=if_else(grepl('Hisp', race),1,0))

sal_US1 = subset(sal_US1, select = -c(race))
##NEW dmp 12/2
p10<-ggplot(filter(sal_US1, White==1), 
            aes(fill=hi_comp, x=White)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank()) 

p11<-ggplot(filter(sal_US1, AsianOrAsianAm==1), 
            aes(fill=hi_comp, x=AsianOrAsianAm)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank()) 

p12<-ggplot(filter(sal_US1, NatAmOrAlaskaNat==1), 
            aes(fill=hi_comp, x=NatAmOrAlaskaNat)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p13<-ggplot(filter(sal_US1, BlackOrAficanAm==1), 
            aes(fill=hi_comp, x=BlackOrAficanAm)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p14<-ggplot(filter(sal_US1, MiddleEorNAfrican==1), 
            aes(fill=hi_comp, x=MiddleEorNAfrican)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p15<-ggplot(filter(sal_US1, HispLatSp==1), 
            aes(fill=hi_comp, x=HispLatSp)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          legend.position="none", axis.text.y=element_blank())

p16<-ggplot(filter(sal_US1, PreferNot==1), 
            aes(fill=hi_comp, x=PreferNot)) + 
    geom_bar(position="fill", stat="count")+
    coord_flip()+
    scale_fill_manual(values=theColors)+
    ylab("percent")+
    theme(plot.title = element_text(size = 8),
          axis.title = element_text(size = 8),
          axis.text.y=element_blank())
ggarrange(p10,p11,p12,p13,p14,p15,p16, nrow=3, ncol = 3)
#create the add_comp_flg field
sal_US1 <- sal_US1 |>
  mutate(add_comp_flg=if_else(!is.na(add_comp),1,0))
sal_US1$add_comp_flg <- as_factor(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(add_comp))
table(sal_US1$add_comp_flg)
library(tidytext)
library(textstem)

# We begin by adding some common stop words that do not have any value from our dataset into an extended stop word dictionary.

stop_words %>% 
  add_row(word= as.character(1:5), lexicon = "group_5") %>%
  add_row(word= c("i","I","ii","II","iii","III"), lexicon = "group_5") -> stop_words

# Then we tokenize the job titles in order to be able to remove the stop words and further process.

text_df <- tibble(line = 1:length(sal_US1$job_title), text = sal_US1$job_title) %>% 
  unnest_tokens(word,text) %>%
  anti_join(stop_words)

# We take the lemmatization of each word used in job title to better standarize the meaning behind the job titles 

text_df$word <- lemmatize_words(text_df$word)

# Below we find those words which occur more than 50 times in order to use that as our singular indicator instead of job title. We keep only the lowest count word for each row of the job title column in order to maximize variability for prediction.

text_df %>% 
  count(word, sort = TRUE) %>% 
  filter(n >= 50) -> top_role

text_df %>% 
  inner_join(top_role, join_by(word == word)) %>% 
  arrange(n) %>% 
  distinct(line,.keep_all = TRUE) -> text_df

sal_US1 %>%
  mutate(Rownumber = row_number()) %>% 
  left_join(text_df, join_by(Rownumber == line )) %>%
  mutate(job_title = word) %>% 
  mutate(job_title = if_else(is.na(job_title), "other", job_title)) %>% 
  select(-Rownumber, -word, -n) -> sal_US1
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp))
#Outliers
#create new variable to flag potential outliers
sal_US$annual_salOUT<-scores(sal_US$annual_sal, type='iqr', lim=1.5)

sal_US$total_compOUT<-scores(sal_US$total_comp, type='iqr', lim=1.5)

#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(sal_US[grep('OUT',names(sal_US))]))
# Correlation Funnel
sal_US1 %>%
  na.omit() |>
  binarize(n_bins = 4, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) |> 
  correlate(hi_comp__1) |> 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
# Correlation Funnel
sal_US1 %>%
  na.omit() %>% 
  select(-hi_comp, -annual_sal) %>% 
  binarize(n_bins = 4, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>%
  correlate(total_comp__120000_Inf) %>% 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
sal_US1 %>% 
  mutate(job_title = if_else(job_title %in% c("engineer","software","administrative","director","librarian"), job_title, "other")) -> sal_US1

sal_US1 %>% 
  pull(job_title) %>% 
  unique()
set.seed(1234)

sal_US2 <- sal_US1 %>% 
  mutate(
    hi_comp = relevel(as_factor(if_else(hi_comp == 1, "high","low")), ref = "low"),
    state = relevel(as_factor(state), ref="District of Columbia"),
    job_title = relevel(as_factor(job_title), ref="other"),
    edu = relevel(as_factor(edu), ref="Prefer not to answer"),
    gender = relevel(as_factor(gender), ref="Prefer not to answer"),
    indust_grp = relevel(as_factor(indust_grp), ref="Service")
  )

train <- createDataPartition(sal_US2$hi_comp,
                             p = 0.7,
                             list = FALSE)

sal_train <- sal_US2[ train,] 
sal_test  <- sal_US2[-train,]

dim(sal_train)
lm_0 <- lm(total_comp ~ . - hi_comp - annual_sal, data = sal_train)
lm_1 <- step(lm_0, direction = "backward")
summary(lm_1)
lm_2 <- lm(total_comp ~ state + job_title + yrs_in_field_exp + edu + gender + indust_grp + title_context_flg + income_context_flg + White + HispLatSp + add_comp_flg, data = sal_train)

summary(lm_2)
ctrl <- trainControl(
  method = "cv", 
  number = 5,
  summaryFunction = twoClassSummary, 
  classProbs = TRUE, 
  savePredictions = T
)

glm1 <- train(hi_comp ~ . - annual_sal -PIncPerCap - total_comp, data = sal_train, 
                          method = 'glm',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC")

glm1
resid_dat <- broom::augment(glm1$finalModel) %>% 
  mutate(index = 1:n())

ggplot(resid_dat, aes(index, .std.resid)) + 
  geom_point(aes(color = .outcome), alpha = .5) +
  theme_minimal()
plot(varImp(glm1), top=20)
par(mfrow=c(2,2))
plot(glm1$finalModel)
(g1c <- glm1 %>% 
  predict(sal_test) %>% 
  confusionMatrix(sal_test$hi_comp) )

evalm <- as_tibble(rbind(c("Model"="Unbalanced Simple Logistic Regression 1",g1c$overall["Accuracy"],g1c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
glm2 <- train(hi_comp ~ job_title + yrs_in_field_exp + edu + gender + indust_grp + add_comp_flg, data = sal_train, 
                          method = 'glm',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC")

glm2
resid_dat <- broom::augment(glm2$finalModel) %>% 
  mutate(index = 1:n())

ggplot(resid_dat, aes(index, .std.resid)) + 
  geom_point(aes(color = .outcome), alpha = .5) +
  theme_minimal()
plot(varImp(glm2), top=20)
summary(glm2)
par(mfrow=c(2,2))
plot(glm2$finalModel)
(g2c <- glm2 |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))

evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="Unbalanced Simple Logistic Regression 2",g2c$overall["Accuracy"],g2c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )
sal_train_bal <- UBL::RandUnderClassif(hi_comp ~ ., sal_train)

sal_train_bal %>% 
  pull(hi_comp) %>% 
  table()
glm3 <- train(hi_comp ~ job_title + yrs_in_field_exp + edu + gender + indust_grp + add_comp_flg, data = sal_train_bal, 
                          method = 'glm',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC")

glm3
resid_dat <- broom::augment(glm3$finalModel) %>% 
  mutate(index = 1:n())

ggplot(resid_dat, aes(index, .std.resid)) + 
  geom_point(aes(color = .outcome), alpha = .5) +
  theme_minimal()
sal_train_bal <- sal_train_bal %>% sample_n(nrow(.))

glm3 <- train(hi_comp ~ job_title + yrs_in_field_exp + edu + gender + indust_grp + add_comp_flg, data = sal_train_bal, 
                          method = 'glm',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC")

glm3
resid_dat <- broom::augment(glm3$finalModel) %>% 
  mutate(index = 1:n())

ggplot(resid_dat, aes(index, .std.resid)) + 
  geom_point(aes(color = .outcome), alpha = .5) +
  theme_minimal()
plot(varImp(glm3), top=20)
summary(glm3)
par(mfrow=c(2,2))
plot(glm3$finalModel)
(g3c <- glm3 |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))

evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="Balanced Simple Logistic Regression",g3c$overall["Accuracy"],g3c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )
glm4 <- train(hi_comp ~ job_title + yrs_in_field_exp + edu + gender + indust_grp + add_comp_flg, data = sal_train_bal, 
                          method = 'glmnet',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC",
                          tuneGrid=expand.grid(
                          .alpha=seq(0, 1, by = 0.2),
                          .lambda=seq(0, 1, by = 0.2)))

glm4
plot(glm4)
plot(varImp(glm4), top=20)
lambda_use <- min(glm4$finalModel$lambda[glm4$finalModel$lambda >= glm4$bestTune$lambda])
position <- which(glm4$finalModel$lambda == lambda_use)
data.frame(coef(glm4$finalModel)[, position])
(g4c <- glm4 |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))

evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="Elastic Net Logistic Regression",g4c$overall["Accuracy"],g4c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )
train_control <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = FALSE,
  allowParallel = FALSE 
)

xgb <- train(hi_comp ~ job_title + yrs_in_field_exp + edu + gender + indust_grp + add_comp_flg, data = sal_train_bal, 
                          method = 'xgbTree',
                          preProcess = c('center', 'scale'),
                          trControl = ctrl,
                          metric = "ROC",
                          verbosity = 0)

xgb
par(mfrow=c(2,2))
plot(xgb)
plot(varImp(xgb), top=20)
(x1c <- xgb |>
  predict(sal_test) |>
  confusionMatrix(sal_test$hi_comp))

evalm <- evalm %>% 
  add_row(
    as_tibble(rbind(c("Model"="XGBoost",x1c$overall["Accuracy"],x1c$byClass[c("Sensitivity","Specificity", "Balanced Accuracy")])))
  )
evalm %>% 
  mutate(
    across(
      c(Accuracy:`Balanced Accuracy`),
      ~round(as.numeric(.x)
             , 4)
    )
  ) %>% 
  arrange(desc(`Balanced Accuracy`)) %>% 
  knitr::kable()