Introduction

Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data to predict if the salaries in the survey end up above or below per capita personal income for state. We originally stated that we would get the state income data from the Census, but found that it only contained median household income. Therefore we are using personal income per capita for 2022, per FRED (Federal Reserve Economic Data: https://fred.stlouisfed.org/release/tables?eid=257197&rid=110).

The data itself is a bit messy being a real-world survey dataset, so this will take a decent amount of data transformation and cleaning. We also want to take two approaches for our predictions: a logistic type of model for predicting if a person ends up above or below median income and a separate quantitative model for predicting income itself.

Loading Data

Here we load:

the Salary Survey data
the Personal Income Per Capita by State

Column Names

Before we do anything else, we want to rename the columns in the Salary Survey Data. As is typical of survey data, the column names are the full text of the questions, and as such are long and unwieldy. It would be cumbersome to do anything in this form.

Description of Variables

Below is table showing the variables in the Salary Survey data set:

Variable	xxxx	Survey Question
timestamp		Timestamp on the survey response
age		How old are you?
industry		What industry do you work in?
job_title		Job title
title_context		If your job title needs additional context, please clarify here
annual_sal		What is your annual salary? (You’ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent – what you would earn if you worked the job 40 hours a week, 52 weeks a year.)
add_comp		How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits.
ccy		Please indicate the currency
other_ccy		If “Other,” please indicate the currency here
income_context		If your income needs additional context, please provide it here
ctry		What country do you work in?
state		If you’re in the U.S., what state do you work in?
city		What city do you work in?
yrs_wrk_exp		How many years of professional work experience do you have overall?
yrs_in_field_exp		How many years of professional work experience do you have in your field?
edu		What is your highest level of education completed?
gender		What is your gender?
race		What is your race? (Choose all that apply.)

And a table showing the variables in the Personal Income Per Capita by State:

Variable	xxxx	Description
state		The name of the US State
PIncPerCap		The personal income per capita for that state, for 2022, per FRED (Federal Reserve Economic Data)

Data Exploration

Let’s take a glimpse at the Salary Survey Data. There are 27,992 rows and 18 columns.

## Rows: 27,992
## Columns: 18
## $ timestamp        <chr> "4/27/2021 11:02:09", "4/27/2021 11:02:21", "4/27/202…
## $ age              <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34",…
## $ industry         <chr> "Education (Higher Education)", "Computing or Tech", …
## $ job_title        <chr> "Research and Instruction Librarian", "Change & Inter…
## $ title_context    <chr> "", "", "", "", "", "", "", "High school, FT", "Data …
## $ annual_sal       <int> 55000, 54600, 34000, 62000, 60000, 62000, 33000, 5000…
## $ add_comp         <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0…
## $ ccy              <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "USD…
## $ other_ccy        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ income_context   <chr> "", "", "", "", "", "", "", "", "", "I work for a Cha…
## $ ctry             <chr> "United States", "United Kingdom", "US", "USA", "US",…
## $ state            <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "South…
## $ city             <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "G…
## $ yrs_wrk_exp      <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 y…
## $ yrs_in_field_exp <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years",…
## $ edu              <chr> "Master's degree", "College degree", "College degree"…
## $ gender           <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "Ma…
## $ race             <chr> "White", "White", "White", "White", "White", "White",…

Data Preprocessing

Before beginning our exploration, we need to take some steps to create the data to be modeled. We will:

convert the empty stings to NA
make the category fields factors and order the values
limit to responses from the US
make sure state is in the form needed to merge with the personal income per capita from the state_inc data
merge the state_inc data
sum the annual_sal and the add_comp to get a total_comp
create a column that indicates if total_comp is above or below the state per capita income

Once these steps are complete, we can explore our data and do any data preparation that might be advantageous in our modeling.

Empty Strings to NA

Many columns have empty strings that should be NA.

Factors and Order

These columns are category fields to made into factors: age, yrs_wrk_exp, yrs_in_field_exp, edu and gender.

## $age
## [1] "25-34"      "45-54"      "35-44"      "18-24"      "65 or over"
## [6] "55-64"      "under 18"  
## 
## $yrs_wrk_exp
## [1] "5-7 years"        "8 - 10 years"     "2 - 4 years"      "21 - 30 years"   
## [5] "11 - 20 years"    "1 year or less"   "41 years or more" "31 - 40 years"   
## 
## $yrs_in_field_exp
## [1] "5-7 years"        "2 - 4 years"      "21 - 30 years"    "11 - 20 years"   
## [5] "1 year or less"   "8 - 10 years"     "31 - 40 years"    "41 years or more"
## 
## $edu
## [1] "Master's degree"                    "College degree"                    
## [3] "PhD"                                "Some college"                      
## [5] "High School"                        "Professional degree (MD, JD, etc.)"
## 
## $gender
## [1] "Woman"                         "Non-binary"                   
## [3] "Man"                           "Other or prefer not to answer"
## [5] "Prefer not to answer"

Looking at these values, it is clear that all but gender can be ordered.

NOTE TO TEAM: When we model, we may find that we do not want this in ‘natural’ order but in order of most to lease common, or most to least likely to have salary > the state per capita income. TBD

Limiting to US only

We will remove any rows not in the US. Part of the messiness of this data is a variety of ways a single country is defined: United States, US, USA etc. We keep it simple and eliminate rows where state=NA. We suspect from the city that 171 rows where state=NA seem to be from the US, and while in some cases the state might be derived from city, in others it is not clear. Since eliminating the rows without a state leaves us with almost 23K rows, this seems the best solution.

After removing all the non US rows, we have 22,995 rows and 17 columns. We removed the ctry column, as it now no longer added value.

State

Taking a look at state, there are 106 cases where multiple states are selected. This is small compared to the total number of rows. We eliminate them we and still have almost 23K rows (22,889).

## [1] "Mississippi, Missouri"                                                        
## [2] "California, Maryland"                                                         
## [3] "California, Illinois, Massachusetts, North Carolina, South Carolina, Virginia"
## [4] "Alabama, California"                                                          
## [5] "Michigan, Texas, Washington"                                                  
## [6] "Alabama, Oregon"

## [1] 106

Merge Data

Now that the state column is clean, we can merge the two data frames, bringing in the per capita income by state to each row.

##        state PIncPerCap
## 1    Alabama      50916
## 2     Alaska      68635
## 3    Arizona      58442
## 4   Arkansas      52618
## 5 California      77036
## 6   Colorado      75722

Target Column for high compensation

We want to create the target column indicating if the salary is above or below Personal Income Per Capita. The steps to do this are:
- sum the annual_sal and add_comp fields to make a total_comp field
- create a column (hi_comp) that compares total_comp and inc_per_cap and set value to 1 if the total_comp is greater or equal and 0 if total_comp is less.

##   annual_sal add_comp total_comp PIncPerCap hi_comp
## 1     120000       NA     120000      50916       1
## 2      70000       NA      70000      50916       1
## 3     120000    15000     135000      50916       1
## 4     175000       NA     175000      50916       1
## 5      40000        0      40000      50916       0
## 6      22800       NA      22800      50916       0

Data Skimming

Now we have the data we want to model. Let’s skim for a quick view of summary statistics, and a visualization of distributions with mini histograms.

After all our pre-proccessing, we have 22,889 responses (rows) with 20 variables.

We notice here that there are many missing values:

16,905 rows are missing additional context to the title (title_context).
22,851 missing values for other currency (other_ccy)
20,408 missing values for additional context for income (income_context)
5,720 missing values for additional compensation (add_comp)
small number of rows missing in city (12), education level (edu - 138), gender (130), industry (56), and race (130).

We also see that the numeric data is mostly skewed right, except for PIncPerCap, which is rather normal.

Data summary
Name	sal_US
Number of rows	22889
Number of columns	20
_______________________
Column type frequency:
character	10
factor	6
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique	whitespace
state	0	1.00	4	20	51	0
timestamp	0	1.00	16	19	20899	0
industry	56	1.00	2	171	1022	0
job_title	0	1.00	1	126	11916	0
title_context	16905	0.26	1	808	5771	10
ccy	0	1.00	3	7	8	0
other_ccy	22851	0.00	1	100	24	0
income_context	20408	0.11	1	939	2435	2
city	12	1.00	1	84	3710	6
race	130	0.99	5	125	45	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
age	0	1.00	TRUE	7	25-: 10180, 35-: 8148, 45-: 2646, 18-: 960
yrs_wrk_exp	0	1.00	TRUE	8	11 : 7978, 8 -: 4371, 5-7: 3924, 21 : 3001
yrs_in_field_exp	0	1.00	TRUE	8	11 : 5401, 5-7: 5363, 2 -: 4944, 8 -: 4130
edu	138	0.99	TRUE	6	Col: 11148, Mas: 7363, Som: 1612, Pro: 1119
gender	130	0.99	FALSE	5	Wom: 17755, Man: 4187, Non: 590, Oth: 226
hi_comp	0	1.00	FALSE	2	1: 14389, 0: 8500

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
annual_sal	0	1.00	92025.95	75165.15	0	56160	78000	112000	5000044	▇▁▁▁▁
add_comp	5720	0.75	12700.58	46465.79	0	0	2000	10000	1500000	▇▁▁▁▁
PIncPerCap	0	1.00	69626.18	10301.27	46370	61475	68840	77036	95970	▁▇▇▅▁
total_comp	0	1.00	101552.64	93505.36	0	58000	81000	120000	5004044	▇▁▁▁▁

Missing Values

Let’s look at how we might address the missing values.

industry (56)

Industry looks like it was a free text field, and there are over 1,000 different values. Not super helpful. A bit of research reveals a grouping of industry sectors as follows:

Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
Manufacturing and Construction
Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
Human Services (eg Government, Nonprofits and NGOs)
We will attempt to group all the industries into one of these groups in a new column - indust_grp.

A sample of the industry values:

## [1] "Computing or Tech"                       
## [2] "Federal Contracting/Business Development"
## [3] "Government and Public Administration"    
## [4] "Law"                                     
## [5] "STEM medical research"                   
## [6] "Education (Higher Education)"

Showing the first few job_title where NA industry:

## [1] "Adult Services Librarian"             
## [2] "Cybersecurity senior risk specialist "
## [3] "Senior Manager of Ops Strategy"       
## [4] "UX Designer"                          
## [5] "Lab Assistant (Level II)"             
## [6] "People Operations Manager "

title_context (16,905)

This field is to provide further explanation if the job_title requires such. Again, this is a free text field, and though most values are NA there are still almost 6K responses, likely to be unique.

Our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the job title (1) or not (0). One thing to consider here is that some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify this case).

## [1] NA                                                                  
## [2] "Senior associate"                                                  
## [3] "Mid- level professional biotechnology laboratory researcher "      
## [4] "na"                                                                
## [5] "Make graphics and videos, write captions post them on social media"
## [6] "Commercial design (not residential)"

other_ccy (22,851)

As we eliminated non US country rows, we would expect the ccy to be all USD, and therefore for other_ccy to be ALL NAs. Instead, there are 13 rows where other_ccy is populated. Look at the values in ccy, we find the assumption that it would all be USD is wrong, as 26 have another ccy.

## 
## AUD/NZD     CAD     EUR     GBP     JPY   Other     SEK     USD 
##       1       5       4       1       1      13       1   22863

Looking at what is in other_ccy, we find that there are a few USD or variations (US Dollar, American Dollars), a few that are another currency or equity/stock. But most seem to be answering a different question.

Since the comparison to the state personal income per capita requires that the total_comp be in USD, we will ‘correct’ the 3 types of ‘others’ that are clearly USD, and then eliminate the other rows that are not USD (or not identifiable as such).

## [1] "Na "                                                 
## [2] "My bonus is based on performance up to 10% of salary"
## [3] "Equity"                                              
## [4] "N/a"                                                 
## [5] "Stock "                                              
## [6] "American Dollars"

income_context (20,408)

Similar to title_context, the purpose of this variable is to provide further explanation if the annual_sal, add_comp and ccy fields requires such. Again, this is a free text field, and though most values are NA, there are still almost 2.5K responses, likely to be unique.

As with title_context, our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the income (1) or not (0). Here again, some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify).

## [1] NA                                                                                                                         
## [2] "Need to Pay Fees"                                                                                                         
## [3] "na"                                                                                                                       
## [4] "24.2/hr"                                                                                                                  
## [5] "Clinical pharmacist in a small hospital"                                                                                  
## [6] "My bonuses are based on my fees in cases and it’s never a consistent amount. Ranges from $10,000-$50,000 per year so far."

city (12)

We notice that in addition to the 12 actual NAs, there are 32 rows with “na”s and another 92 with some variation of “prefer not to answer”.

We are not sure how city might effect the model. Some cities might have higher salaries, but since again, this is free text, there are over 2K unique responses.

Since there are almost as many unique responses as there are rows, and we cannot think of a way to group these, we will eliminate this column.

## [1] "Birmingham"  "Huntsville"  "Birmingham " "Auburn"      "Athens"     
## [6] "na"

Demographic variables: race, edu, gender

There are very similar count of NAs for race, edu and gender (130, 138 and 130, respectively).

There are about 20 or so responses where all three of these are blank. Makes sense to get rid of those rows.

For the rest:

gender has both “Other or prefer not to answer” and “Prefer not to answer” as well as NAs. Will combine those into a single “Prefer not to answer”
similarly, will add replace the NAs in edu to “Prefer not to answer”.
race will take a bit more work, as it is a check all that apply. So in addition to replacing the NAs with “Prefer not to answer”, we also have to split out the field into who chose each of the 7 options provided (one of which is “Another option not listed here or prefer not to answer” - so the NAs will go there).

Note that there are 6 options shown below, but there is also a 7th: ‘Hispanic, Latino, or Spanish origin’, which only appears along with other options selected.

## [1] "White"                                                 
## [2] "Asian or Asian American"                               
## [3] "Another option not listed here or prefer not to answer"
## [4] "Native American or Alaska Native"                      
## [5] "Black or African American"                             
## [6] "Middle Eastern or Northern African"                    
## [7] NA

We now have a plan for what data preparation is needed for missing values. Let’s explore other aspects of the data.

Variable Distribution

The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.

In our first look, visualizing distribution is impacted by the very long tails.

## Warning: `label_number_si()` was deprecated in scales 1.2.0.
## ℹ Please use the `scale_cut` argument of `label_number()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 5720 rows containing non-finite values (`stat_bin()`).

The 75 percentile for annual_sal is less than $150,000, while the max is over $5,000,000. There are a very small (less than 20) number of rows where either annual_sal or add_comp are greater than 1 million. In order to see what the majority (first 75%) of the data looks like, we will filter annual_sal, add_comp and total_comp <150,000.

When we do this, we can see that the large majority of the data is pretty normal. The right skewness for annual_sal, add_comp and total_comp come from the small number of very high values.

PIncPerCap looks multimodal.

Target Flag Breakdown

We want to take a look into what contributes to having a total_comp greater than the PIncPerCap. Look at the distribution of whose total_comp does and who does not by the qualitative variables that we have.

This shows that the percent of respondents with hi_comp=1 (total_comp is greater than PIncPerCap):

generally increases with age
increases with both yrs_wrk_exp and yrs_in_field_exp until about 30 years then decreases.
increases with education levels
gender is a bit all over, but needs some clean up to combine NA, and the two variations of “Prefer not to answer”

Target Amount Breakdown

NOTE TO TEAM: I wanted to try this, but all it really shows is that if total_comp is higher, one is more likely to have hi_comp=1. Not really a revelation. Maybe we take this out?

Again here, we look at the lower 3 percentiles. (total_Comp<150000)

Maps

To understand how the Personal Income Per Capita varies by state, we can plot it on a map.

Because it is so small, it is hard to see that DC has the highest per capita personal income, $95,970. But you can see that this is followed by Massachusetts (MA) $84,561 and Connecticut (CT) $82,938. Of course the per capita calculation will be greatly impacted by population, especially the highest population states (California, Texas, Florida and New York) and lowest population states (Wyoming, Vermont, DC, and Alaska).

We are also interested to see how many responses there are in each state. To do this, we create a small table with just the state and the count of rows with that state in the sal_US data.

California (CA) 2,607 and New York (NY) 2,168 are the top two. Wyoming (WY) 24 and South Dakota (SD) 25 are at the bottom. Every state (+ DC) are represented.

For our third map, we are interested in the mean of the total_comp reported by state. As we did above, we will create a small table to capture the data, then map it.

Top 3 are California, $135,362.24 (CA), Washington, $127,442.40 (WA) and New York, $117,769.82 (NY). At the other end of the spectrum are Wyoming, $60,475.96 (WY), Montana, $63,200.66 (MT) and Mississippi, $63,929.22 (MS).

Data Preparation

Now that we have taken a good look at our data, we can begin to prepare it to be useful in modeling to both predict when total_comp is greater than personal income per capita and the predict actual total_comp.

In order to be able to compare to our (pre-processed but still) baseline data, we will create a new data frame, sal_US1.

Missing Values

Some missing values will actually be addressed by creating new categories, so those will be discussed in the following section.

ccy

To make valid comparisons, we want all responses to be reported in USD. As described above, there are a few rows with ccy = ‘other’ where the value in other_ccy make it clear it is USD. We will update those and then delete the 23 non-USD rows. Once that is complete, we no longer need ccy or other_ccy, so we will remove those columns.

city

Recall city this is free text, with almost as many unique values as rows, and no good way to categorize it. We do not need to deal with the NAs, as we will remove this column.

New Categorical Columns

Industry Group

As described above, we will group industry into 5 categories as follows:

Raw Materials (eg Farming, Logging, Fishing, Forestry and Mining)
Manufacturing and Construction Service (eg Doctors, Lawyers, Barbers, Mechanics, Banks, Insurance)
Information Services (eg IT, Media, R&D, Consulting, Education, Financial Planning)
Human Services (eg Government, Nonprofits and NGOs)

This will take some iterative effort to identify the group for all the many many varied values of the free text field industry.

Once complete, we can see that Information Services has the largest number of responses, and very few were from the Raw Materials Industry Group.

## 
##                 Human Services           Information Services 
##                           4166                           9083 
## Manufacturing and Construction                  Raw Materials 
##                           1876                            169 
##                        Service 
##                           7523

There are 58 rows left with NA in industry and therefore in indust_group. We will remove those rows, and then remove industry, leaving just the indust_group column.

We now have 22,817 rows and 17 variables.

Context fields

We will turn both context fields, title_context and income_context into flags (0,1) if there is additional context. The values in these fields are too varied to be of any use in categorizing any other way. But it is possible a response where it was necessary to add context might impact total_comp one way or another. We will then remove the original fields, leaving the same number of variables, 17.

As we can see, we have 5,974 rows with title_context_flg=1

## 
##     0     1 
## 16843  5974

And 2,475 rows with income_comtext_flg=1

## 
##     0     1 
## 20342  2475

Demographic Columns

There are a few rows where race, edu and gender are all NA. These will be removed. This leave 22,799 rows.

Next, we want to combine “Other or prefer not..” + “Prefer not..” + NAs in gender, leaving no NAs and a more organized set of values.

## [1] "Man"                  "Woman"                "Non-binary"          
## [4] "Prefer not to answer"

Similarly, we want to combine “Prefer not..” + NAs in edu, leaving no NAs and a more organized set of values.

## [1] "College degree"                     "Master's degree"                   
## [3] "Professional degree (MD, JD, etc.)" "High School"                       
## [5] "Some college"                       "PhD"                               
## [7] "Prefer not to answer"

On race, we combine “Another or prefer not..” + NAs. Since this is a pick all that apply field we then must make a set of flags to indicate indicate if one of the 7 choices was / was not selected. Once those fields are created, we remove the original multiselect race column.

Since we were not able to look at these fields in the Target Flag Breakdown section of the Data Exploration, we will do so here.

add_comp

Much of add_comp is NA, as not everyone gets compensation beyond salary. We have captured the added amount in total_comp (annual_sal+add_comp). But it may be of interest if add_comp is received or not. So, here again, we create a flag to indicate if add_comp was populated (1) or not (0). Then we remove the original field.

As we can see, we have 17,110 rows with add_comp_flg=1

## 
##     0     1 
##  5689 17110

Timestap and job_title

Timestamp is just the datetime of the response. job_title is another free text field with values impossible to catagorize. We will remove both.

#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp, job_title))

We now have 22,799 rows with 21 variables which we have either cleaned or created.

Outliers

To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.

We will create flags for each variable to identify potential outliers.

Using this method, we created:
- annual_salOUT: which shows we have 891 (4%) potential outliers in annual_sal - total_compOUT: which shows 1,250 (5%) potential Outliers in total_comp

The table below shows the resulting values (TRUE is a potential outlier).

##       annual_salOUT total_compOUT
## FALSE         21998         21639
## TRUE            891          1250

Correlation

Since most of the data is categorical, we can use a correlation funnel to look for correlation in our data.

Build Models

Select Models

Conclusion

Code Appendix

library(tidyverse)
library(skimr)      #skim
library(gdata)      #trim
library(ggpubr)     #ggarrange
library(scales)     #formatting numbers in ggplot axis 
library(cowplot)    #density chart
library(outliers)   #scores
library(qdapTools)  #mtabulate
library(ggcorrplot)   #corrplot for categorical data
library(correlationfunnel)
library(usmap)
#Loading Data
sal_df <- as.data.frame(read.csv('https://raw.githubusercontent.com/dianaplunkett/621/main/Ask%20A%20Manager%20Salary%20Survey%202021%20(Responses).csv'))
#load in the per capita income data.
state_inc<-as.data.frame(read.csv("https://raw.githubusercontent.com/dianaplunkett/621/main/PersonalIncomePerCapita.csv"))
names(sal_df)<-c('timestamp','age','industry','job_title', 'title_context','annual_sal','add_comp','ccy','other_ccy','income_context','ctry','state','city','yrs_wrk_exp','yrs_in_field_exp','edu','gender', 'race')

names(state_inc)<-c('state','PIncPerCap')
### Data Preprocessing 
glimpse(sal_df)
#empty strings to NA
sal_df[sal_df == ''] <- NA
#make factors
sal_df |>
  mutate(
    across(
      c(age, yrs_wrk_exp, yrs_in_field_exp, edu, gender),
      as_factor
    )
  )  -> sal_df

sal_df |>
  dplyr::select(where(is.factor)) |>
  sapply(levels)
#order factors
sal_df |>
  mutate(
    age = ordered(age, 
                  c("under 18", 
                    "18-24", 
                    "25-34", 
                    "35-44", 
                    "45-54", 
                    "55-64", 
                    "65 or over")),
    yrs_wrk_exp = ordered(yrs_wrk_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    yrs_in_field_exp = ordered(yrs_in_field_exp, 
                          c("1 year or less", 
                            "2 - 4 years", 
                            "5-7 years", 
                            "8 - 10 years", 
                            "11 - 20 years",
                            "21 - 30 years",
                            "31 - 40 years",
                            "41 years or more")),
    edu = ordered(edu, 
                    c("High School", 
                      "Some college", 
                      "College degree", 
                      "Master's degree", 
                      "PhD", 
                      "Professional degree (MD, JD, etc.)")),
  ) -> sal_df
#create a new df sal_US with only rows where state != NA
sal_US <- sal_df[!is.na(sal_df$state),]
#remove the ctry column, all rows are now US
sal_US = subset(sal_US, select = -c(ctry) )
#unique state shows 132 different values, just showing tail for clarity. 
tail(unique(sal_US$state))
#how many rows have a ',', eg multiple states
length(which(grepl(',', sal_US$state)))
#delete the rows with multiple states
sal_US <- sal_US[!grepl(',', sal_US$state),]
#merge rows 
sal_US <- merge(sal_US, state_inc, by ='state', all.x=TRUE)

head(sal_US |> distinct(state,PIncPerCap))
#sum the annual_sal and add_comp
sal_US$total_comp <- rowSums(sal_US[, c('annual_sal','add_comp')], na.rm=TRUE)

#create the hi_comp field
sal_US <- sal_US |> mutate(hi_comp=if_else(total_comp>PIncPerCap,1,0))
sal_US$hi_comp <- as_factor(sal_US$hi_comp)

head(sal_US |> select (annual_sal, add_comp,total_comp,PIncPerCap, hi_comp))
#Data Skimming
skim(sal_US)
head(unique(sal_US$industry))
head(sal_US$job_title[is.na(sal_US$industry)])
head(unique(sal_US$title_context))
table(sal_US$ccy)
tail (unique(sal_US$other_ccy))
head(unique(sal_US$income_context))
head(unique(sal_US$city))
#### Demographic variables
unique(sal_US[(!grepl(',', sal_US$race)), c('race')])
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
pa <- sal_US |> 
    ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pb <- sal_US |> 
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
pc <- sal_US |> 
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
pd <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(pa,pb,pc,pd)
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
p1 <- sal_US |> 
    filter(annual_sal < 150000) |>
  ggplot (aes(x=annual_sal))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Annual Salary") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p2 <- sal_US |> 
    filter(add_comp < 150000) |>
  ggplot (aes(x=add_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Additional Compensation") +
    scale_x_continuous(labels = scales::label_number_si())+
    theme_classic()
p3 <- sal_US |> 
    filter(total_comp < 150000) |>
  ggplot (aes(x=total_comp))+
    geom_histogram(bins=50, fill ="#6aaec8")+
    labs(x = "Total Compensation") +
    scale_x_continuous(labels = scales::label_number_si()) +
    theme_classic()
p4 <- sal_US |> 
  ggplot (aes(x=PIncPerCap))+
    geom_histogram(bins=20, fill ="#6aaec8")+
    scale_x_continuous(labels = scales::label_number_si())+
    labs(x = "Personal Income Per Capita") +
    theme_classic()

ggarrange(p1,p2,p3,p4)
theColors <- c('#6aaec8','#52985e')

p5<-ggplot(sal_US, aes(fill=hi_comp, x=age), axis.title.y=element_blank()) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Age")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p6<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_wrk_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years Work Exp")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p7<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_in_field_exp)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Years In Field")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p8<-ggplot(sal_US, aes(fill=hi_comp, x=edu)) + 
    geom_bar(position="fill", stat="count", show.legend = FALSE)+
    ggtitle("Education")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p9<-ggplot(sal_US, aes(fill=hi_comp, x=gender)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Gender")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 10), axis.title.y=element_blank(), legend.title=element_text(size=8))

ggarrange(p5,p6,p7,p8,p9, nrow=3, ncol = 2)
# Target Amount Breakdown
# density plots
multdens <- function(df, var, target) {
  
  df |> 
    dplyr::select(annual_sal, add_comp, total_comp, hi_comp) |> 
    filter(!is.na(.data[[var]])) |> filter(total_comp < 150000) |>
    ggplot(aes(.data[[var]], fill = .data[[target]], color = .data[[target]])) +
      geom_density(alpha = 0.4) +
    scale_fill_manual(values=theColors)+
    scale_color_manual(values=theColors)+
      labs(y = NULL) +
      theme_minimal(base_size = 6) +
      theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}

num_col <- sal_US |>
  dplyr::select(annual_sal, add_comp, total_comp) |>
  colnames()

dens <- lapply(num_col, multdens, df = sal_US, target = "hi_comp")

plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)
# map of per capita personal income 
map<-plot_usmap(data= state_inc, values ='PIncPerCap', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Personal Income Per Capita', label = scales::comma) + 
  theme(legend.position = "right")

# Set label font size
map$layers[[2]]$aes_params$size <- 1.5
print(map)
#first we want to gather the data to map
occ<-as.data.frame(table(sal_US$state))
names(occ)<-c('state','responses')
#map of responses by state
map_2<-plot_usmap(data= occ, values ='responses', color = '#52985e', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#03688f',name = 'Responses', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_2$layers[[2]]$aes_params$size <- 1.5
print(map_2)
#first we want to gather the data to map
avg<-sal_US |>
  group_by(state) |>
  summarise_at(vars(total_comp), list(name = mean))
names(avg)<-c('state','avg_comp')
#map of responses by state
map_3<-plot_usmap(data= avg, values ='avg_comp', color = '#03688f', labels=TRUE, label_color = '#36454F')+
  scale_fill_continuous(low = 'white', high = '#52985e',name = 'Mean Total Comp', label = scales::comma) + 
  theme(legend.position = 'right')

# Set label font size
map_3$layers[[2]]$aes_params$size <- 1.5
print(map_3)
# Data Preparation
# new data frame for data prep 
sal_US1 <- sal_US
#correct the others that are really USD
sal_US1 <- sal_US1 |>
  mutate (ccy = ifelse(ccy=='Other' & (other_ccy=='USD' | other_ccy == "US Dollar" | other_ccy == "American Dollars"),'USD', ccy))

#Keep only the rows where ccy=USD
sal_US1 <- sal_US1[sal_US1$ccy =='USD',]

table(sal_US1$ccy)
#remove ccy anc other_ccy columns as now all USD
sal_US1 = subset(sal_US1, select = -c(ccy, other_ccy))
#remove city column
sal_US1 = subset(sal_US1, select = -c(city))
#make indust_grp field
sal_US1 <- sal_US1 %>% mutate (indust_grp = 
            case_when (
#Raw Materials
              grepl('Forest', industry) ~ 'Raw Materials',
              grepl('Mining', industry) ~ 'Raw Materials',
              grepl('Oil', industry) ~ 'Raw Materials',
              grepl('oil', industry) ~ 'Raw Materials',
              grepl('Energy', industry) ~ 'Raw Materials',
              grepl('energy', industry) ~ 'Raw Materials',
              grepl('Petroleum', industry) ~ 'Raw Materials',
              grepl('Agriculture', industry) ~ 'Raw Materials',              grepl('Fisher', industry) ~ 'Raw Materials',
              
#Manufacturing and Construction             
              grepl('Construction', industry) ~ 'Manufacturing and Construction',
              grepl('Contracting', industry) ~ 'Manufacturing and Construction',
              grepl('Manufac', industry) ~ 'Manufacturing and Construction',
              grepl('MANUFAC', industry) ~ 'Manufacturing and Construction',
              grepl('manufac', industry) ~ 'Manufacturing and Construction',
              grepl('Mfg', industry) ~ 'Manufacturing and Construction',
              grepl('coat', industry) ~ 'Manufacturing and Construction',
              grepl('Wine', industry) ~ 'Manufacturing and Construction',              
              grepl('Packaged Goods', industry) ~ 'Manufacturing and Construction',
            grepl('packaged goods', industry) ~ 'Manufacturing and Construction',
            grepl('Goods', industry) ~ 'Manufacturing and Construction',
            grepl('Consumer', industry) ~ 'Manufacturing and Construction',
            grepl('consumer', industry) ~ 'Manufacturing and Construction',
            grepl('Commercial furniture', industry) ~ 'Manufacturing and Construction',
            grepl('Industrial', industry) ~ 'Manufacturing and Construction',
            grepl('Craft Beer Industry', industry) ~ 'Manufacturing and Construction',
            grepl('Beverage', industry) ~ 'Manufacturing and Construction',
            grepl('Brew', industry) ~ 'Manufacturing and Construction',
            grepl('Chemicals', industry) ~ 'Manufacturing and Construction',
            grepl('Automotive', industry) ~ 'Manufacturing and Construction',
#Information Services             
              grepl('Educ', industry) ~ 'Information Services',
              grepl('educ', industry) ~ 'Information Services',
              grepl('acad', industry) ~ 'Information Services',
              grepl('learn', industry) ~ 'Information Services',
              grepl('School', industry) ~ 'Information Services',
              grepl('Acad', industry) ~ 'Information Services',
              grepl('College', industry) ~ 'Information Services',
              grepl('Tech', industry) ~ 'Information Services',
              grepl('tech', industry) ~ 'Information Services',
              grepl('Anal', industry) ~ 'Information Services',
              grepl('development', industry) ~ 'Information Services',

              grepl('Title IX', industry) ~ 'Information Services',
              grepl('Publication', industry) ~ 'Information Services',              
              grepl('edit', industry) ~ 'Information Services',     grepl('Publish', industry) ~ 'Information Services',
              grepl('publish', industry) ~ 'Information Services',
grepl('Ecology', industry) ~ 'Information Services', 
grepl('R&D', industry) ~ 'Information Services',              
grepl('Info', industry) ~ 'Information Services',
              grepl('Librar', industry) ~ 'Information Services',
              grepl('LIbrar', industry) ~ 'Information Services',
              grepl('librar', industry) ~ 'Information Services',
              grepl('LIBRAR', industry) ~ 'Information Services',

              grepl('Archivist', industry) ~ 'Information Services',
              grepl('Reporter', industry) ~ 'Information Services',
              grepl('Engineer', industry) ~ 'Information Services',
              grepl('engineer', industry) ~ 'Information Services',
              grepl('Copywriter', industry) ~ 'Information Services',
              grepl('Develop', industry) ~ 'Information Services',
              grepl('Edit', industry) ~ 'Information Services',
              grepl('Design', industry) ~ 'Information Services',
              grepl('Teach', industry) ~ 'Information Services',
              grepl('Train', industry) ~ 'Information Services',
              grepl('Instruct', industry) ~ 'Information Services',
              grepl('Research', industry) ~ 'Information Services',
              grepl('research', industry) ~ 'Information Services',
              grepl('Media', industry) ~ 'Information Services',
              grepl('STEM', industry) ~ 'Information Services',
              grepl('Think tank', industry) ~ 'Information Services',
              grepl('Marketing', industry) ~ 'Information Services',
              grepl('marketing', industry) ~ 'Information Services',
              grepl('Communica', industry) ~ 'Information Services',
              grepl('market research', industry) ~ 'Information Services',
              grepl('Data', industry) ~ 'Information Services',
              grepl('data', industry) ~ 'Information Services',
              grepl('content', industry) ~ 'Information Services',
              grepl('Content', industry) ~ 'Information Services',
              grepl('CPG', industry) ~ 'Information Services',
              grepl('IT', industry) ~ 'Information Services',
              grepl('Scien', industry) ~ 'Information Services',
              grepl('science', industry) ~ 'Information Services',
              grepl('Video', industry) ~ 'Information Services',
              grepl('Internet', industry) ~ 'Information Services',
            grepl('Archaeology', industry) ~ 'Information Services',
              grepl('Bio', industry) ~ 'Information Services',
              grepl('chemistry', industry) ~ 'Information Services',
              grepl('student', industry) ~ 'Information Services',
              grepl('Student', industry) ~ 'Information Services',
              grepl('Graduate assistant', industry) ~ 'Information Services',
              grepl('It security', industry) ~ 'Information Services',
              grepl('lab', industry) ~ 'Information Services',            grepl('Print', industry) ~ 'Information Services', 
            grepl('Chemistry', industry) ~ 'Information Services',
            grepl('Gaming', industry) ~ 'Information Services',
            grepl('Software', industry) ~ 'Information Services',
            grepl('Journalism', industry) ~ 'Information Services',
            grepl('journalism', industry) ~ 'Information Services',
            grepl('Work-Study', industry) ~ 'Information Services',
            grepl('Geospatial', industry) ~ 'Information Services',
            grepl('Virtual reality', industry) ~ 'Information Services',
            grepl('Survey', industry) ~ 'Information Services',
            
            grepl('PR', industry) ~ 'Information Services',
            grepl('Quality Assurance', industry) ~ 'Information Services',
#Human Services              
              grepl('Gov', industry) ~ 'Human Services',
              grepl('gov', industry) ~ 'Human Services',
              grepl('Volun', industry) ~ 'Human Services',
              grepl('Policy', industry) ~ 'Human Services',
              grepl('City', industry) ~ 'Human Services',
              grepl('Federal', industry) ~ 'Human Services',
              grepl('federal', industry) ~ 'Human Services',
              grepl('State', industry) ~ 'Human Services',
              grepl('Grant', industry) ~ 'Human Services',

              grepl('Nonprofit', industry) ~ 'Human Services',
              grepl('Non-profit', industry) ~ 'Human Services',
              grepl('Workforce', industry) ~ 'Human Services',
              grepl('Social Work', industry) ~ 'Human Services',
              grepl('Fundrais', industry) ~ 'Human Services',
              grepl('Environ', industry) ~ 'Human Services',
              grepl('environ', industry) ~ 'Human Services',
              grepl('Natural', industry) ~ 'Human Services',
              grepl('not for profit', industry) ~ 'Human Services',
              grepl('Non Profit', industry) ~ 'Human Services',
              grepl('Non profit', industry) ~ 'Human Services',
              grepl('Military', industry) ~ 'Human Services',
              grepl('Ministry', industry) ~ 'Human Services',
              grepl('Clergy', industry) ~ 'Human Services',
              grepl('Chaplain', industry) ~ 'Human Services',
              grepl('Foundation', industry) ~ 'Human Services',
              grepl('Religio', industry) ~ 'Human Services',
              grepl('Heritage', industry) ~ 'Human Services',
              grepl('AmeriCorps', industry) ~ 'Human Services',
              grepl('Intelligence', industry) ~ 'Human Services',
              grepl('Museum', industry) ~ 'Human Services',
              grepl('Zoo', industry) ~ 'Human Services',
              grepl('Philanthropy', industry) ~ 'Human Services',
              grepl('International development', industry) ~ 'Human Services',
              grepl('Historic Preservation', industry) ~ 'Human Services',
              grepl('Maritime', industry) ~ 'Human Services',
              grepl('Church', industry) ~ 'Human Services',
              grepl('Association', industry) ~ 'Human Services',
              grepl('association', industry) ~ 'Human Services',
              grepl('Politic', industry) ~ 'Human Services',
              grepl('politic', industry) ~ 'Human Services',
              grepl('Defense', industry) ~ 'Human Services',
              grepl('union', industry) ~ 'Human Services',
              grepl('Union', industry) ~ 'Human Services',
              grepl('Labor', industry) ~ 'Human Services',

              grepl('Funding', industry) ~ 'Human Services',
              grepl('Performing Arts', industry) ~ 'Human Services',
              grepl('Hosp', industry) ~ 'Human Services',
              grepl('parent', industry) ~ 'Human Services',

#Service              
              grepl('Law', industry) ~ 'Service',
              grepl('Legal', industry) ~ 'Service',
              grepl('attorney', industry) ~ 'Service',
              grepl('Psychol', industry) ~ 'Service',
              grepl('Counseling', industry) ~ 'Service',
              grepl('Doc', industry) ~ 'Service',
              grepl('Forensics', industry) ~ 'Service',
              grepl('Toxicology', industry) ~ 'Service',
              grepl('Medical', industry) ~ 'Service',
              grepl('Manage', industry) ~ 'Service',
              grepl('manage', industry) ~ 'Service',
              grepl('Admin', industry) ~ 'Service',
              grepl('Reception', industry) ~ 'Service',
              grepl('design', industry) ~ 'Service',
              grepl('Sale', industry) ~ 'Service',
              grepl('sale', industry) ~ 'Service',
              grepl('Buyer', industry) ~ 'Service',
              grepl('Bank', industry) ~ 'Service',
              grepl('finance', industry) ~ 'Service',
              grepl('trading', industry) ~ 'Service',
              grepl('Archi', industry) ~ 'Service',
              grepl('archi', industry) ~ 'Service',
              grepl('product development', industry) ~ 'Service',
              grepl('Equipment', industry) ~ 'Service',
              grepl('Health', industry) ~ 'Service',
              grepl('health', industry) ~ 'Service',
              grepl('Retail', industry) ~ 'Service',
              grepl('Tour', industry) ~ 'Service',
              grepl('Screening', industry) ~ 'Service',
              grepl('Insur', industry) ~ 'Service',
              grepl('Hospitality', industry) ~ 'Service',
              grepl('Consult', industry) ~ 'Service',
              grepl('consult', industry) ~ 'Service',
              grepl('Rideshare', industry) ~ 'Service',
              grepl('Transp', industry) ~ 'Service',
              grepl('Logist', industry) ~ 'Service',
              grepl('Distribution', industry) ~ 'Service',
              grepl('Deliver', industry) ~ 'Service',
              grepl('deliver', industry) ~ 'Service',
              grepl('sustainability', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('Planning', industry) ~ 'Service',      
              grepl('Recruit', industry) ~ 'Service',
              grepl('HR', industry) ~ 'Service',
              grepl('Equity', industry) ~ 'Service',
              grepl('Human', industry) ~ 'Service',
              grepl('Childcare', industry) ~ 'Service',
              grepl('Child care', industry) ~ 'Service',
              grepl('Child Care', industry) ~ 'Service',
              grepl('Telecom', industry) ~ 'Service',
              grepl('parking', industry) ~ 'Service',
              grepl('restoration', industry) ~ 'Service',
              grepl('Homemaker', industry) ~ 'Service',
              grepl('Caregiver', industry) ~ 'Service',

              grepl('Landscap', industry) ~ 'Service',
              grepl('landscap', industry) ~ 'Service',
              grepl('Repair', industry) ~ 'Service',
              grepl('repair', industry) ~ 'Service',
              grepl('Facilities', industry) ~ 'Service',

              grepl('engraving', industry) ~ 'Service',
              grepl('Call center', industry) ~ 'Service',
              grepl('Wholesale', industry) ~ 'Service',
              grepl('wholesale', industry) ~ 'Service',
              grepl('CALL', industry) ~ 'Service',
              grepl('Pharma', industry) ~ 'Service',
              grepl('pharma', industry) ~ 'Service',
              grepl('Regulator', industry) ~ 'Service',
              grepl('Entertain', industry) ~ 'Service',
              grepl('Sport', industry) ~ 'Service',
              grepl('Music', industry) ~ 'Service',
              grepl('music', industry) ~ 'Service',
              grepl('I work for Indeed.com', industry) ~ 'Service',
              grepl('Interpret', industry) ~ 'Service',
              grepl('Funeral', industry) ~ 'Service',
              grepl('funeral', industry) ~ 'Service',
              grepl('Translation', industry) ~ 'Service',
              grepl('Concrete', industry) ~ 'Service',
              grepl('ecomm', industry) ~ 'Service',
              grepl('E-comm', industry) ~ 'Service',
              grepl('E-Comm', industry) ~ 'Service',
              grepl('e-comm', industry) ~ 'Service',
              grepl('E comm', industry) ~ 'Service',
              grepl('EComm', industry) ~ 'Service',
              grepl('Ecomm', industry) ~ 'Service',
              grepl('Pest', industry) ~ 'Service',
              grepl('Service', industry) ~ 'Service',
              grepl('service', industry) ~ 'Service',
              grepl('Cleaning', industry) ~ 'Service',
              grepl('Saas', industry) ~ 'Service',
              grepl('compli', industry) ~ 'Service',
              grepl('Assisting', industry) ~ 'Service',

              grepl('Apparel', industry) ~ 'Service',
              grepl('appraisal', industry) ~ 'Service',
              grepl('Compli', industry) ~ 'Service',
              grepl('Business', industry) ~ 'Service',
              grepl('Strategy', industry) ~ 'Service',
              grepl('Eap', industry) ~ 'Service',
              grepl('Investing', industry) ~ 'Service',
              grepl('Storage', industry) ~ 'Service',
              grepl('Supply', industry) ~ 'Service',
              grepl('Procur', industry) ~ 'Service',
              grepl('Airline', industry) ~ 'Service',
              grepl('Aerospace', industry) ~ 'Service',
              grepl('aviat', industry) ~ 'Service',
              grepl('Aviat', industry) ~ 'Service',
              grepl('Importing', industry) ~ 'Service',
              grepl('Staffing', industry) ~ 'Service',
              grepl('Cannabis', industry) ~ 'Service',
              grepl('warehous', industry) ~ 'Service',
              grepl('Warehous', industry) ~ 'Service',
              grepl('Rental', industry) ~ 'Service',
              grepl('Food', industry) ~ 'Service',
              grepl('food', industry) ~ 'Service',
              grepl('cook', industry) ~ 'Service',
              grepl('Beauty', industry) ~ 'Service',
              grepl('safety', industry) ~ 'Service',
              grepl('protect', industry) ~ 'Service',
  
              grepl('Operations', industry) ~ 'Service',
              grepl('Real Estate', industry) ~ 'Service',
              grepl('Real estate', industry) ~ 'Service',
              grepl('real estate', industry) ~ 'Service',
              grepl('Fitness', industry) ~ 'Service',
              grepl('Gym', industry) ~ 'Service',
              grepl('accessibility', industry) ~ 'Service',
              grepl('Accessibility', industry) ~ 'Service',
              grepl('Executive', industry) ~ 'Service',
              grepl('fitness', industry) ~ 'Service',
              grepl('Security', industry) ~ 'Service',
              grepl('Per Sit', industry) ~ 'Service',
              grepl('Pet', industry) ~ 'Service',
              grepl('Vet', industry) ~ 'Service',
              grepl('Animal', industry) ~ 'Service',
              grepl('restaurant', industry) ~ 'Service',
              grepl('Restaurant', industry) ~ 'Service',
              grepl('Auction', industry) ~ 'Service',
              grepl('Mortgage', industry) ~ 'Service',
              grepl('mortgage', industry) ~ 'Service',
              grepl('Tax', industry) ~ 'Service',
              grepl('Film', industry) ~ 'Service',

              grepl('Travel', industry) ~ 'Service',
              grepl('Gambl', industry) ~ 'Service',
              grepl('office', industry) ~ 'Service',
              grepl('Office', industry) ~ 'Service',
              grepl('recycling', industry) ~ 'Service',
              grepl('Fashion', industry) ~ 'Service',
              grepl('Finance', industry) ~ 'Service',
              grepl('coach', industry) ~ 'Service',
              grepl('Corporate', industry) ~ 'Service',
              grepl('camp', industry) ~ 'Service',            
              grepl('Cosmetology', industry) ~ 'Service',
              grepl('Purchasing', industry) ~ 'Service',
              grepl('Printing', industry) ~ 'Service',
              grepl('Global Mobility', industry) ~ 'Service',
              grepl('professional', industry) ~ 'Service',
              grepl('Horticulture', industry) ~ 'Service',
              grepl('Maintenance', industry) ~ 'Service',
              grepl('Janitorial', industry) ~ 'Service'
    ))
table(sal_US1$indust_grp)

sal_US1 <- sal_US1[!is.na(sal_US1$indust_grp),]
sal_US1 = subset(sal_US1, select = -c(industry))
#create the title_context_flg field
sal_US1 <- sal_US1 |>
  mutate(title_context_flg=if_else(!is.na(title_context),1,0))
sal_US1$title_context_flg <- as_factor(sal_US1$title_context_flg)
#create the income_context_flg field
sal_US1 <- sal_US1 |>
  mutate(income_context_flg=if_else(!is.na(income_context),1,0))
sal_US1$income_context_flg <- as_factor(sal_US1$income_context_flg)

sal_US1 = subset(sal_US1, select = -c(income_context, title_context))
table(sal_US1$title_context_flg)
table(sal_US1$income_context_flg)
sal_US1<-sal_US1[rowSums(is.na(sal_US1[, c("race", "edu", "gender")])) < 3,]
sal_US1 <- sal_US1 |>
  mutate(gender=if_else(grepl('refer', gender) | is.na(gender),"Prefer not to answer",gender))

unique(sal_US1$gender)
sal_US1 <- sal_US1 |>
  mutate(edu=if_else(is.na(edu),"Prefer not to answer",edu))

unique(sal_US1$edu)
sal_US1 <- sal_US1 |>
  mutate(race=if_else(is.na(race),"Another or prefer not to answer",race))

sal_US1 <- sal_US1 |>
  mutate(White=if_else(grepl('White', race),1,0)) |>
  mutate(AsianOrAsianAm=if_else(grepl('Asian', race),1,0)) |>
  mutate(NatAmOrAlaskaNat=if_else(grepl('Native', race),1,0)) |>
  mutate(BlackOrAficanAm=if_else(grepl('Black', race),1,0)) |>
  mutate(MiddleEorNAfrican=if_else(grepl('Middle', race),1,0))|>
  mutate(PreferNot=if_else(grepl('prefer', race),1,0)) |>
  mutate(HispLatSp=if_else(grepl('Hisp', race),1,0))

sal_US1 = subset(sal_US1, select = -c(race))
p10<-ggplot(sal_US1, aes(fill=hi_comp, x=White)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("White")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p11<-ggplot(sal_US1, aes(fill=hi_comp, x=AsianOrAsianAm)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Asian or Asian American")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p12<-ggplot(sal_US1, aes(fill=hi_comp, x=NatAmOrAlaskaNat)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Native American or Alaska Native")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p13<-ggplot(sal_US1, aes(fill=hi_comp, x=BlackOrAficanAm)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Black or African American")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p14<-ggplot(sal_US1, aes(fill=hi_comp, x=MiddleEorNAfrican)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Middle Eastern or Northern African")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p15<-ggplot(sal_US1, aes(fill=hi_comp, x=HispLatSp)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Hispanic, Latino, or Spanish origin")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p16<-ggplot(sal_US1, aes(fill=hi_comp, x=PreferNot)) + 
    geom_bar(position="fill", stat="count")+
    ggtitle("Another or prefer not to answer")+ coord_flip()+
    scale_fill_manual(values=theColors)+
    theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
ggarrange(p10,p11,p12,p13,p14,p15,p16, nrow=3, ncol = 3)
#create the add_comp_flg field
sal_US1 <- sal_US1 |>
  mutate(add_comp_flg=if_else(!is.na(add_comp),1,0))
sal_US1$add_comp_flg <- as_factor(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(add_comp))
table(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp, job_title))
#Outliers
#create new variable to flag potential outliers
sal_US$annual_salOUT<-scores(sal_US$annual_sal, type='iqr', lim=1.5)

sal_US$total_compOUT<-scores(sal_US$total_comp, type='iqr', lim=1.5)

#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(sal_US[grep('OUT',names(sal_US))]))
# Correlation Funnel
sal_US1 %>%
  na.omit() |>
  binarize(n_bins = 4, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) |> 
  correlate(hi_comp__1) |> 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)

D621 Final Project - US Salary Data

Group 5: Taha Ahmad, PK O’Flaherty, Diana Plunkett, Justin Williams

2023-11-12

Introduction

Loading Data

Column Names

Description of Variables

Data Exploration

Data Preprocessing

Empty Strings to NA

Factors and Order

Limiting to US only

State

Merge Data

Target Column for high compensation

Data Skimming

Missing Values

industry (56)

title_context (16,905)

other_ccy (22,851)

income_context (20,408)

city (12)

Demographic variables: race, edu, gender

Variable Distribution

Target Flag Breakdown

Target Amount Breakdown

Maps

Data Preparation

Missing Values

ccy

city

New Categorical Columns

Industry Group

Context fields

Demographic Columns

add_comp

Timestap and job_title

Outliers

Correlation

Build Models

Select Models

Conclusion

Code Appendix