Our goal is to utilize job salary data (retrieved from Ask A Manager Salary Survey here: https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792) and demographic data to predict if the salaries in the survey end up above or below per capita personal income for state. We originally stated that we would get the state income data from the Census, but found that it only contained median household income. Therefore we are using personal income per capita for 2022, per FRED (Federal Reserve Economic Data: https://fred.stlouisfed.org/release/tables?eid=257197&rid=110).
The data itself is a bit messy being a real-world survey dataset, so this will take a decent amount of data transformation and cleaning. We also want to take two approaches for our predictions: a logistic type of model for predicting if a person ends up above or below median income and a separate quantitative model for predicting income itself.
Here we load:
Before we do anything else, we want to rename the columns in the Salary Survey Data. As is typical of survey data, the column names are the full text of the questions, and as such are long and unwieldy. It would be cumbersome to do anything in this form.
Below is table showing the variables in the Salary Survey data set:
| Variable | xxxx | Survey Question |
|---|---|---|
| timestamp | Timestamp on the survey response | |
| age | How old are you? | |
| industry | What industry do you work in? | |
| job_title | Job title | |
| title_context | If your job title needs additional context, please clarify here | |
| annual_sal | What is your annual salary? (You’ll indicate the currency in a later question. If you are part-time or hourly, please enter an annualized equivalent – what you would earn if you worked the job 40 hours a week, 52 weeks a year.) | |
| add_comp | How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Please only include monetary compensation here, not the value of benefits. | |
| ccy | Please indicate the currency | |
| other_ccy | If “Other,” please indicate the currency here | |
| income_context | If your income needs additional context, please provide it here | |
| ctry | What country do you work in? | |
| state | If you’re in the U.S., what state do you work in? | |
| city | What city do you work in? | |
| yrs_wrk_exp | How many years of professional work experience do you have overall? | |
| yrs_in_field_exp | How many years of professional work experience do you have in your field? | |
| edu | What is your highest level of education completed? | |
| gender | What is your gender? | |
| race | What is your race? (Choose all that apply.) |
And a table showing the variables in the Personal Income Per Capita by State:
| Variable | xxxx | Description |
|---|---|---|
| state | The name of the US State | |
| PIncPerCap | The personal income per capita for that state, for 2022, per FRED (Federal Reserve Economic Data) |
Let’s take a glimpse at the Salary Survey Data. There are 27,992 rows and 18 columns.
## Rows: 27,992
## Columns: 18
## $ timestamp <chr> "4/27/2021 11:02:09", "4/27/2021 11:02:21", "4/27/202…
## $ age <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34",…
## $ industry <chr> "Education (Higher Education)", "Computing or Tech", …
## $ job_title <chr> "Research and Instruction Librarian", "Change & Inter…
## $ title_context <chr> "", "", "", "", "", "", "", "High school, FT", "Data …
## $ annual_sal <int> 55000, 54600, 34000, 62000, 60000, 62000, 33000, 5000…
## $ add_comp <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0…
## $ ccy <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "USD…
## $ other_ccy <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ income_context <chr> "", "", "", "", "", "", "", "", "", "I work for a Cha…
## $ ctry <chr> "United States", "United Kingdom", "US", "USA", "US",…
## $ state <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "South…
## $ city <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "G…
## $ yrs_wrk_exp <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 y…
## $ yrs_in_field_exp <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years",…
## $ edu <chr> "Master's degree", "College degree", "College degree"…
## $ gender <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "Ma…
## $ race <chr> "White", "White", "White", "White", "White", "White",…
Before beginning our exploration, we need to take some steps to create the data to be modeled. We will:
Once these steps are complete, we can explore our data and do any data preparation that might be advantageous in our modeling.
Many columns have empty strings that should be NA.
These columns are category fields to made into factors: age, yrs_wrk_exp, yrs_in_field_exp, edu and gender.
## $age
## [1] "25-34" "45-54" "35-44" "18-24" "65 or over"
## [6] "55-64" "under 18"
##
## $yrs_wrk_exp
## [1] "5-7 years" "8 - 10 years" "2 - 4 years" "21 - 30 years"
## [5] "11 - 20 years" "1 year or less" "41 years or more" "31 - 40 years"
##
## $yrs_in_field_exp
## [1] "5-7 years" "2 - 4 years" "21 - 30 years" "11 - 20 years"
## [5] "1 year or less" "8 - 10 years" "31 - 40 years" "41 years or more"
##
## $edu
## [1] "Master's degree" "College degree"
## [3] "PhD" "Some college"
## [5] "High School" "Professional degree (MD, JD, etc.)"
##
## $gender
## [1] "Woman" "Non-binary"
## [3] "Man" "Other or prefer not to answer"
## [5] "Prefer not to answer"
Looking at these values, it is clear that all but gender can be ordered.
NOTE TO TEAM: When we model, we may find that we do not want this in ‘natural’ order but in order of most to lease common, or most to least likely to have salary > the state per capita income. TBD
We will remove any rows not in the US. Part of the messiness of this data is a variety of ways a single country is defined: United States, US, USA etc. We keep it simple and eliminate rows where state=NA. We suspect from the city that 171 rows where state=NA seem to be from the US, and while in some cases the state might be derived from city, in others it is not clear. Since eliminating the rows without a state leaves us with almost 23K rows, this seems the best solution.
After removing all the non US rows, we have 22,995 rows and 17 columns. We removed the ctry column, as it now no longer added value.
Taking a look at state, there are 106 cases where multiple states are selected. This is small compared to the total number of rows. We eliminate them we and still have almost 23K rows (22,889).
## [1] "Mississippi, Missouri"
## [2] "California, Maryland"
## [3] "California, Illinois, Massachusetts, North Carolina, South Carolina, Virginia"
## [4] "Alabama, California"
## [5] "Michigan, Texas, Washington"
## [6] "Alabama, Oregon"
## [1] 106
Now that the state column is clean, we can merge the two data frames, bringing in the per capita income by state to each row.
## state PIncPerCap
## 1 Alabama 50916
## 2 Alaska 68635
## 3 Arizona 58442
## 4 Arkansas 52618
## 5 California 77036
## 6 Colorado 75722
We want to create the target column indicating if the salary is above
or below Personal Income Per Capita. The steps to do this are:
- sum the annual_sal and add_comp fields to make a total_comp
field
- create a column (hi_comp) that compares total_comp and inc_per_cap and
set value to 1 if the total_comp is greater or equal and 0 if total_comp
is less.
## annual_sal add_comp total_comp PIncPerCap hi_comp
## 1 120000 NA 120000 50916 1
## 2 70000 NA 70000 50916 1
## 3 120000 15000 135000 50916 1
## 4 175000 NA 175000 50916 1
## 5 40000 0 40000 50916 0
## 6 22800 NA 22800 50916 0
Now we have the data we want to model. Let’s skim for a quick view of summary statistics, and a visualization of distributions with mini histograms.
After all our pre-proccessing, we have 22,889 responses (rows) with 20 variables.
We notice here that there are many missing values:
We also see that the numeric data is mostly skewed right, except for PIncPerCap, which is rather normal.
| Name | sal_US |
| Number of rows | 22889 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| factor | 6 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1.00 | 4 | 20 | 0 | 51 | 0 |
| timestamp | 0 | 1.00 | 16 | 19 | 0 | 20899 | 0 |
| industry | 56 | 1.00 | 2 | 171 | 0 | 1022 | 0 |
| job_title | 0 | 1.00 | 1 | 126 | 0 | 11916 | 0 |
| title_context | 16905 | 0.26 | 1 | 808 | 0 | 5771 | 10 |
| ccy | 0 | 1.00 | 3 | 7 | 0 | 8 | 0 |
| other_ccy | 22851 | 0.00 | 1 | 100 | 0 | 24 | 0 |
| income_context | 20408 | 0.11 | 1 | 939 | 0 | 2435 | 2 |
| city | 12 | 1.00 | 1 | 84 | 0 | 3710 | 6 |
| race | 130 | 0.99 | 5 | 125 | 0 | 45 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| age | 0 | 1.00 | TRUE | 7 | 25-: 10180, 35-: 8148, 45-: 2646, 18-: 960 |
| yrs_wrk_exp | 0 | 1.00 | TRUE | 8 | 11 : 7978, 8 -: 4371, 5-7: 3924, 21 : 3001 |
| yrs_in_field_exp | 0 | 1.00 | TRUE | 8 | 11 : 5401, 5-7: 5363, 2 -: 4944, 8 -: 4130 |
| edu | 138 | 0.99 | TRUE | 6 | Col: 11148, Mas: 7363, Som: 1612, Pro: 1119 |
| gender | 130 | 0.99 | FALSE | 5 | Wom: 17755, Man: 4187, Non: 590, Oth: 226 |
| hi_comp | 0 | 1.00 | FALSE | 2 | 1: 14389, 0: 8500 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| annual_sal | 0 | 1.00 | 92025.95 | 75165.15 | 0 | 56160 | 78000 | 112000 | 5000044 | ▇▁▁▁▁ |
| add_comp | 5720 | 0.75 | 12700.58 | 46465.79 | 0 | 0 | 2000 | 10000 | 1500000 | ▇▁▁▁▁ |
| PIncPerCap | 0 | 1.00 | 69626.18 | 10301.27 | 46370 | 61475 | 68840 | 77036 | 95970 | ▁▇▇▅▁ |
| total_comp | 0 | 1.00 | 101552.64 | 93505.36 | 0 | 58000 | 81000 | 120000 | 5004044 | ▇▁▁▁▁ |
Let’s look at how we might address the missing values.
Industry looks like it was a free text field, and there are over 1,000 different values. Not super helpful. A bit of research reveals a grouping of industry sectors as follows:
A sample of the industry values:
## [1] "Computing or Tech"
## [2] "Federal Contracting/Business Development"
## [3] "Government and Public Administration"
## [4] "Law"
## [5] "STEM medical research"
## [6] "Education (Higher Education)"
Showing the first few job_title where NA industry:
## [1] "Adult Services Librarian"
## [2] "Cybersecurity senior risk specialist "
## [3] "Senior Manager of Ops Strategy"
## [4] "UX Designer"
## [5] "Lab Assistant (Level II)"
## [6] "People Operations Manager "
This field is to provide further explanation if the job_title requires such. Again, this is a free text field, and though most values are NA there are still almost 6K responses, likely to be unique.
Our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the job title (1) or not (0). One thing to consider here is that some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify this case).
## [1] NA
## [2] "Senior associate"
## [3] "Mid- level professional biotechnology laboratory researcher "
## [4] "na"
## [5] "Make graphics and videos, write captions post them on social media"
## [6] "Commercial design (not residential)"
As we eliminated non US country rows, we would expect the ccy to be all USD, and therefore for other_ccy to be ALL NAs. Instead, there are 13 rows where other_ccy is populated. Look at the values in ccy, we find the assumption that it would all be USD is wrong, as 26 have another ccy.
##
## AUD/NZD CAD EUR GBP JPY Other SEK USD
## 1 5 4 1 1 13 1 22863
Looking at what is in other_ccy, we find that there are a few USD or variations (US Dollar, American Dollars), a few that are another currency or equity/stock. But most seem to be answering a different question.
Since the comparison to the state personal income per capita requires that the total_comp be in USD, we will ‘correct’ the 3 types of ‘others’ that are clearly USD, and then eliminate the other rows that are not USD (or not identifiable as such).
## [1] "Na "
## [2] "My bonus is based on performance up to 10% of salary"
## [3] "Equity"
## [4] "N/a"
## [5] "Stock "
## [6] "American Dollars"
Similar to title_context, the purpose of this variable is to provide further explanation if the annual_sal, add_comp and ccy fields requires such. Again, this is a free text field, and though most values are NA, there are still almost 2.5K responses, likely to be unique.
As with title_context, our strategy here will be to turn this into flag (0,1) indicating if there was something more to explain about the income (1) or not (0). Here again, some responses seem to have written in “na” or something similar - we will want those to become 0 rather than 1 (where it is possible to identify).
## [1] NA
## [2] "Need to Pay Fees"
## [3] "na"
## [4] "24.2/hr"
## [5] "Clinical pharmacist in a small hospital"
## [6] "My bonuses are based on my fees in cases and it’s never a consistent amount. Ranges from $10,000-$50,000 per year so far."
We notice that in addition to the 12 actual NAs, there are 32 rows with “na”s and another 92 with some variation of “prefer not to answer”.
We are not sure how city might effect the model. Some cities might have higher salaries, but since again, this is free text, there are over 2K unique responses.
Since there are almost as many unique responses as there are rows, and we cannot think of a way to group these, we will eliminate this column.
## [1] "Birmingham" "Huntsville" "Birmingham " "Auburn" "Athens"
## [6] "na"
There are very similar count of NAs for race, edu and gender (130, 138 and 130, respectively).
There are about 20 or so responses where all three of these are blank. Makes sense to get rid of those rows.
For the rest:
Note that there are 6 options shown below, but there is also a 7th: ‘Hispanic, Latino, or Spanish origin’, which only appears along with other options selected.
## [1] "White"
## [2] "Asian or Asian American"
## [3] "Another option not listed here or prefer not to answer"
## [4] "Native American or Alaska Native"
## [5] "Black or African American"
## [6] "Middle Eastern or Northern African"
## [7] NA
We now have a plan for what data preparation is needed for missing values. Let’s explore other aspects of the data.
The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.
In our first look, visualizing distribution is impacted by the very long tails.
## Warning: `label_number_si()` was deprecated in scales 1.2.0.
## ℹ Please use the `scale_cut` argument of `label_number()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 5720 rows containing non-finite values (`stat_bin()`).
The 75 percentile for annual_sal is less than $150,000, while the max is over $5,000,000. There are a very small (less than 20) number of rows where either annual_sal or add_comp are greater than 1 million. In order to see what the majority (first 75%) of the data looks like, we will filter annual_sal, add_comp and total_comp <150,000.
When we do this, we can see that the large majority of the data is pretty normal. The right skewness for annual_sal, add_comp and total_comp come from the small number of very high values.
PIncPerCap looks multimodal.
We want to take a look into what contributes to having a total_comp greater than the PIncPerCap. Look at the distribution of whose total_comp does and who does not by the qualitative variables that we have.
This shows that the percent of respondents with hi_comp=1 (total_comp is greater than PIncPerCap):
NOTE TO TEAM: I wanted to try this, but all it really shows is that if total_comp is higher, one is more likely to have hi_comp=1. Not really a revelation. Maybe we take this out?
Again here, we look at the lower 3 percentiles. (total_Comp<150000)
To understand how the Personal Income Per Capita varies by state, we can plot it on a map.
Because it is so small, it is hard to see that DC has the highest per capita personal income, $95,970. But you can see that this is followed by Massachusetts (MA) $84,561 and Connecticut (CT) $82,938. Of course the per capita calculation will be greatly impacted by population, especially the highest population states (California, Texas, Florida and New York) and lowest population states (Wyoming, Vermont, DC, and Alaska).
We are also interested to see how many responses there are in each state. To do this, we create a small table with just the state and the count of rows with that state in the sal_US data.
California (CA) 2,607 and New York (NY) 2,168 are the top two. Wyoming (WY) 24 and South Dakota (SD) 25 are at the bottom. Every state (+ DC) are represented.
For our third map, we are interested in the mean of the total_comp reported by state. As we did above, we will create a small table to capture the data, then map it.
Top 3 are California, $135,362.24 (CA), Washington, $127,442.40 (WA) and New York, $117,769.82 (NY). At the other end of the spectrum are Wyoming, $60,475.96 (WY), Montana, $63,200.66 (MT) and Mississippi, $63,929.22 (MS).
Now that we have taken a good look at our data, we can begin to prepare it to be useful in modeling to both predict when total_comp is greater than personal income per capita and the predict actual total_comp.
In order to be able to compare to our (pre-processed but still) baseline data, we will create a new data frame, sal_US1.
Some missing values will actually be addressed by creating new categories, so those will be discussed in the following section.
To make valid comparisons, we want all responses to be reported in USD. As described above, there are a few rows with ccy = ‘other’ where the value in other_ccy make it clear it is USD. We will update those and then delete the 23 non-USD rows. Once that is complete, we no longer need ccy or other_ccy, so we will remove those columns.
Recall city this is free text, with almost as many unique values as rows, and no good way to categorize it. We do not need to deal with the NAs, as we will remove this column.
As described above, we will group industry into 5 categories as follows:
This will take some iterative effort to identify the group for all the many many varied values of the free text field industry.
Once complete, we can see that Information Services has the largest number of responses, and very few were from the Raw Materials Industry Group.
##
## Human Services Information Services
## 4166 9083
## Manufacturing and Construction Raw Materials
## 1876 169
## Service
## 7523
There are 58 rows left with NA in industry and therefore in indust_group. We will remove those rows, and then remove industry, leaving just the indust_group column.
We now have 22,817 rows and 17 variables.
We will turn both context fields, title_context and income_context into flags (0,1) if there is additional context. The values in these fields are too varied to be of any use in categorizing any other way. But it is possible a response where it was necessary to add context might impact total_comp one way or another. We will then remove the original fields, leaving the same number of variables, 17.
As we can see, we have 5,974 rows with title_context_flg=1
##
## 0 1
## 16843 5974
And 2,475 rows with income_comtext_flg=1
##
## 0 1
## 20342 2475
There are a few rows where race, edu and gender are all NA. These will be removed. This leave 22,799 rows.
Next, we want to combine “Other or prefer not..” + “Prefer not..” + NAs in gender, leaving no NAs and a more organized set of values.
## [1] "Man" "Woman" "Non-binary"
## [4] "Prefer not to answer"
Similarly, we want to combine “Prefer not..” + NAs in edu, leaving no NAs and a more organized set of values.
## [1] "College degree" "Master's degree"
## [3] "Professional degree (MD, JD, etc.)" "High School"
## [5] "Some college" "PhD"
## [7] "Prefer not to answer"
On race, we combine “Another or prefer not..” + NAs. Since this is a pick all that apply field we then must make a set of flags to indicate indicate if one of the 7 choices was / was not selected. Once those fields are created, we remove the original multiselect race column.
Since we were not able to look at these fields in the Target Flag Breakdown section of the Data Exploration, we will do so here.
Much of add_comp is NA, as not everyone gets compensation beyond salary. We have captured the added amount in total_comp (annual_sal+add_comp). But it may be of interest if add_comp is received or not. So, here again, we create a flag to indicate if add_comp was populated (1) or not (0). Then we remove the original field.
As we can see, we have 17,110 rows with add_comp_flg=1
##
## 0 1
## 5689 17110
Timestamp is just the datetime of the response. job_title is another free text field with values impossible to catagorize. We will remove both.
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp, job_title))
We now have 22,799 rows with 21 variables which we have either cleaned or created.
To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.
We will create flags for each variable to identify potential outliers.
Using this method, we created:
- annual_salOUT: which shows we have 891 (4%) potential outliers in
annual_sal - total_compOUT: which shows 1,250 (5%) potential Outliers in
total_comp
The table below shows the resulting values (TRUE is a potential outlier).
## annual_salOUT total_compOUT
## FALSE 21998 21639
## TRUE 891 1250
Since most of the data is categorical, we can use a correlation funnel to look for correlation in our data.
library(tidyverse)
library(skimr) #skim
library(gdata) #trim
library(ggpubr) #ggarrange
library(scales) #formatting numbers in ggplot axis
library(cowplot) #density chart
library(outliers) #scores
library(qdapTools) #mtabulate
library(ggcorrplot) #corrplot for categorical data
library(correlationfunnel)
library(usmap)
#Loading Data
sal_df <- as.data.frame(read.csv('https://raw.githubusercontent.com/dianaplunkett/621/main/Ask%20A%20Manager%20Salary%20Survey%202021%20(Responses).csv'))
#load in the per capita income data.
state_inc<-as.data.frame(read.csv("https://raw.githubusercontent.com/dianaplunkett/621/main/PersonalIncomePerCapita.csv"))
names(sal_df)<-c('timestamp','age','industry','job_title', 'title_context','annual_sal','add_comp','ccy','other_ccy','income_context','ctry','state','city','yrs_wrk_exp','yrs_in_field_exp','edu','gender', 'race')
names(state_inc)<-c('state','PIncPerCap')
### Data Preprocessing
glimpse(sal_df)
#empty strings to NA
sal_df[sal_df == ''] <- NA
#make factors
sal_df |>
mutate(
across(
c(age, yrs_wrk_exp, yrs_in_field_exp, edu, gender),
as_factor
)
) -> sal_df
sal_df |>
dplyr::select(where(is.factor)) |>
sapply(levels)
#order factors
sal_df |>
mutate(
age = ordered(age,
c("under 18",
"18-24",
"25-34",
"35-44",
"45-54",
"55-64",
"65 or over")),
yrs_wrk_exp = ordered(yrs_wrk_exp,
c("1 year or less",
"2 - 4 years",
"5-7 years",
"8 - 10 years",
"11 - 20 years",
"21 - 30 years",
"31 - 40 years",
"41 years or more")),
yrs_in_field_exp = ordered(yrs_in_field_exp,
c("1 year or less",
"2 - 4 years",
"5-7 years",
"8 - 10 years",
"11 - 20 years",
"21 - 30 years",
"31 - 40 years",
"41 years or more")),
edu = ordered(edu,
c("High School",
"Some college",
"College degree",
"Master's degree",
"PhD",
"Professional degree (MD, JD, etc.)")),
) -> sal_df
#create a new df sal_US with only rows where state != NA
sal_US <- sal_df[!is.na(sal_df$state),]
#remove the ctry column, all rows are now US
sal_US = subset(sal_US, select = -c(ctry) )
#unique state shows 132 different values, just showing tail for clarity.
tail(unique(sal_US$state))
#how many rows have a ',', eg multiple states
length(which(grepl(',', sal_US$state)))
#delete the rows with multiple states
sal_US <- sal_US[!grepl(',', sal_US$state),]
#merge rows
sal_US <- merge(sal_US, state_inc, by ='state', all.x=TRUE)
head(sal_US |> distinct(state,PIncPerCap))
#sum the annual_sal and add_comp
sal_US$total_comp <- rowSums(sal_US[, c('annual_sal','add_comp')], na.rm=TRUE)
#create the hi_comp field
sal_US <- sal_US |> mutate(hi_comp=if_else(total_comp>PIncPerCap,1,0))
sal_US$hi_comp <- as_factor(sal_US$hi_comp)
head(sal_US |> select (annual_sal, add_comp,total_comp,PIncPerCap, hi_comp))
#Data Skimming
skim(sal_US)
head(unique(sal_US$industry))
head(sal_US$job_title[is.na(sal_US$industry)])
head(unique(sal_US$title_context))
table(sal_US$ccy)
tail (unique(sal_US$other_ccy))
head(unique(sal_US$income_context))
head(unique(sal_US$city))
#### Demographic variables
unique(sal_US[(!grepl(',', sal_US$race)), c('race')])
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
pa <- sal_US |>
ggplot (aes(x=annual_sal))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Annual Salary") +
scale_x_continuous(labels = scales::label_number_si())+
theme_classic()
pb <- sal_US |>
ggplot (aes(x=add_comp))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Additional Compensation") +
scale_x_continuous(labels = scales::label_number_si())+
theme_classic()
pc <- sal_US |>
ggplot (aes(x=total_comp))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Total Compensation") +
scale_x_continuous(labels = scales::label_number_si()) +
theme_classic()
pd <- sal_US |>
ggplot (aes(x=PIncPerCap))+
geom_histogram(bins=20, fill ="#6aaec8")+
scale_x_continuous(labels = scales::label_number_si())+
labs(x = "Personal Income Per Capita") +
theme_classic()
ggarrange(pa,pb,pc,pd)
#Variable Distribution
#Skim histos are small, so looking at some a bit closer
options(scipen = 999)
p1 <- sal_US |>
filter(annual_sal < 150000) |>
ggplot (aes(x=annual_sal))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Annual Salary") +
scale_x_continuous(labels = scales::label_number_si())+
theme_classic()
p2 <- sal_US |>
filter(add_comp < 150000) |>
ggplot (aes(x=add_comp))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Additional Compensation") +
scale_x_continuous(labels = scales::label_number_si())+
theme_classic()
p3 <- sal_US |>
filter(total_comp < 150000) |>
ggplot (aes(x=total_comp))+
geom_histogram(bins=50, fill ="#6aaec8")+
labs(x = "Total Compensation") +
scale_x_continuous(labels = scales::label_number_si()) +
theme_classic()
p4 <- sal_US |>
ggplot (aes(x=PIncPerCap))+
geom_histogram(bins=20, fill ="#6aaec8")+
scale_x_continuous(labels = scales::label_number_si())+
labs(x = "Personal Income Per Capita") +
theme_classic()
ggarrange(p1,p2,p3,p4)
theColors <- c('#6aaec8','#52985e')
p5<-ggplot(sal_US, aes(fill=hi_comp, x=age), axis.title.y=element_blank()) +
geom_bar(position="fill", stat="count", show.legend = FALSE)+
ggtitle("Age")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p6<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_wrk_exp)) +
geom_bar(position="fill", stat="count", show.legend = FALSE)+
ggtitle("Years Work Exp")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p7<-ggplot(sal_US, aes(fill=hi_comp, x=yrs_in_field_exp)) +
geom_bar(position="fill", stat="count", show.legend = FALSE)+
ggtitle("Years In Field")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p8<-ggplot(sal_US, aes(fill=hi_comp, x=edu)) +
geom_bar(position="fill", stat="count", show.legend = FALSE)+
ggtitle("Education")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 10), axis.title.y=element_blank())
p9<-ggplot(sal_US, aes(fill=hi_comp, x=gender)) +
geom_bar(position="fill", stat="count")+
ggtitle("Gender")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 10), axis.title.y=element_blank(), legend.title=element_text(size=8))
ggarrange(p5,p6,p7,p8,p9, nrow=3, ncol = 2)
# Target Amount Breakdown
# density plots
multdens <- function(df, var, target) {
df |>
dplyr::select(annual_sal, add_comp, total_comp, hi_comp) |>
filter(!is.na(.data[[var]])) |> filter(total_comp < 150000) |>
ggplot(aes(.data[[var]], fill = .data[[target]], color = .data[[target]])) +
geom_density(alpha = 0.4) +
scale_fill_manual(values=theColors)+
scale_color_manual(values=theColors)+
labs(y = NULL) +
theme_minimal(base_size = 6) +
theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}
num_col <- sal_US |>
dplyr::select(annual_sal, add_comp, total_comp) |>
colnames()
dens <- lapply(num_col, multdens, df = sal_US, target = "hi_comp")
plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)
# map of per capita personal income
map<-plot_usmap(data= state_inc, values ='PIncPerCap', color = '#03688f', labels=TRUE, label_color = '#36454F')+
scale_fill_continuous(low = 'white', high = '#52985e',name = 'Personal Income Per Capita', label = scales::comma) +
theme(legend.position = "right")
# Set label font size
map$layers[[2]]$aes_params$size <- 1.5
print(map)
#first we want to gather the data to map
occ<-as.data.frame(table(sal_US$state))
names(occ)<-c('state','responses')
#map of responses by state
map_2<-plot_usmap(data= occ, values ='responses', color = '#52985e', labels=TRUE, label_color = '#36454F')+
scale_fill_continuous(low = 'white', high = '#03688f',name = 'Responses', label = scales::comma) +
theme(legend.position = 'right')
# Set label font size
map_2$layers[[2]]$aes_params$size <- 1.5
print(map_2)
#first we want to gather the data to map
avg<-sal_US |>
group_by(state) |>
summarise_at(vars(total_comp), list(name = mean))
names(avg)<-c('state','avg_comp')
#map of responses by state
map_3<-plot_usmap(data= avg, values ='avg_comp', color = '#03688f', labels=TRUE, label_color = '#36454F')+
scale_fill_continuous(low = 'white', high = '#52985e',name = 'Mean Total Comp', label = scales::comma) +
theme(legend.position = 'right')
# Set label font size
map_3$layers[[2]]$aes_params$size <- 1.5
print(map_3)
# Data Preparation
# new data frame for data prep
sal_US1 <- sal_US
#correct the others that are really USD
sal_US1 <- sal_US1 |>
mutate (ccy = ifelse(ccy=='Other' & (other_ccy=='USD' | other_ccy == "US Dollar" | other_ccy == "American Dollars"),'USD', ccy))
#Keep only the rows where ccy=USD
sal_US1 <- sal_US1[sal_US1$ccy =='USD',]
table(sal_US1$ccy)
#remove ccy anc other_ccy columns as now all USD
sal_US1 = subset(sal_US1, select = -c(ccy, other_ccy))
#remove city column
sal_US1 = subset(sal_US1, select = -c(city))
#make indust_grp field
sal_US1 <- sal_US1 %>% mutate (indust_grp =
case_when (
#Raw Materials
grepl('Forest', industry) ~ 'Raw Materials',
grepl('Mining', industry) ~ 'Raw Materials',
grepl('Oil', industry) ~ 'Raw Materials',
grepl('oil', industry) ~ 'Raw Materials',
grepl('Energy', industry) ~ 'Raw Materials',
grepl('energy', industry) ~ 'Raw Materials',
grepl('Petroleum', industry) ~ 'Raw Materials',
grepl('Agriculture', industry) ~ 'Raw Materials', grepl('Fisher', industry) ~ 'Raw Materials',
#Manufacturing and Construction
grepl('Construction', industry) ~ 'Manufacturing and Construction',
grepl('Contracting', industry) ~ 'Manufacturing and Construction',
grepl('Manufac', industry) ~ 'Manufacturing and Construction',
grepl('MANUFAC', industry) ~ 'Manufacturing and Construction',
grepl('manufac', industry) ~ 'Manufacturing and Construction',
grepl('Mfg', industry) ~ 'Manufacturing and Construction',
grepl('coat', industry) ~ 'Manufacturing and Construction',
grepl('Wine', industry) ~ 'Manufacturing and Construction',
grepl('Packaged Goods', industry) ~ 'Manufacturing and Construction',
grepl('packaged goods', industry) ~ 'Manufacturing and Construction',
grepl('Goods', industry) ~ 'Manufacturing and Construction',
grepl('Consumer', industry) ~ 'Manufacturing and Construction',
grepl('consumer', industry) ~ 'Manufacturing and Construction',
grepl('Commercial furniture', industry) ~ 'Manufacturing and Construction',
grepl('Industrial', industry) ~ 'Manufacturing and Construction',
grepl('Craft Beer Industry', industry) ~ 'Manufacturing and Construction',
grepl('Beverage', industry) ~ 'Manufacturing and Construction',
grepl('Brew', industry) ~ 'Manufacturing and Construction',
grepl('Chemicals', industry) ~ 'Manufacturing and Construction',
grepl('Automotive', industry) ~ 'Manufacturing and Construction',
#Information Services
grepl('Educ', industry) ~ 'Information Services',
grepl('educ', industry) ~ 'Information Services',
grepl('acad', industry) ~ 'Information Services',
grepl('learn', industry) ~ 'Information Services',
grepl('School', industry) ~ 'Information Services',
grepl('Acad', industry) ~ 'Information Services',
grepl('College', industry) ~ 'Information Services',
grepl('Tech', industry) ~ 'Information Services',
grepl('tech', industry) ~ 'Information Services',
grepl('Anal', industry) ~ 'Information Services',
grepl('development', industry) ~ 'Information Services',
grepl('Title IX', industry) ~ 'Information Services',
grepl('Publication', industry) ~ 'Information Services',
grepl('edit', industry) ~ 'Information Services', grepl('Publish', industry) ~ 'Information Services',
grepl('publish', industry) ~ 'Information Services',
grepl('Ecology', industry) ~ 'Information Services',
grepl('R&D', industry) ~ 'Information Services',
grepl('Info', industry) ~ 'Information Services',
grepl('Librar', industry) ~ 'Information Services',
grepl('LIbrar', industry) ~ 'Information Services',
grepl('librar', industry) ~ 'Information Services',
grepl('LIBRAR', industry) ~ 'Information Services',
grepl('Archivist', industry) ~ 'Information Services',
grepl('Reporter', industry) ~ 'Information Services',
grepl('Engineer', industry) ~ 'Information Services',
grepl('engineer', industry) ~ 'Information Services',
grepl('Copywriter', industry) ~ 'Information Services',
grepl('Develop', industry) ~ 'Information Services',
grepl('Edit', industry) ~ 'Information Services',
grepl('Design', industry) ~ 'Information Services',
grepl('Teach', industry) ~ 'Information Services',
grepl('Train', industry) ~ 'Information Services',
grepl('Instruct', industry) ~ 'Information Services',
grepl('Research', industry) ~ 'Information Services',
grepl('research', industry) ~ 'Information Services',
grepl('Media', industry) ~ 'Information Services',
grepl('STEM', industry) ~ 'Information Services',
grepl('Think tank', industry) ~ 'Information Services',
grepl('Marketing', industry) ~ 'Information Services',
grepl('marketing', industry) ~ 'Information Services',
grepl('Communica', industry) ~ 'Information Services',
grepl('market research', industry) ~ 'Information Services',
grepl('Data', industry) ~ 'Information Services',
grepl('data', industry) ~ 'Information Services',
grepl('content', industry) ~ 'Information Services',
grepl('Content', industry) ~ 'Information Services',
grepl('CPG', industry) ~ 'Information Services',
grepl('IT', industry) ~ 'Information Services',
grepl('Scien', industry) ~ 'Information Services',
grepl('science', industry) ~ 'Information Services',
grepl('Video', industry) ~ 'Information Services',
grepl('Internet', industry) ~ 'Information Services',
grepl('Archaeology', industry) ~ 'Information Services',
grepl('Bio', industry) ~ 'Information Services',
grepl('chemistry', industry) ~ 'Information Services',
grepl('student', industry) ~ 'Information Services',
grepl('Student', industry) ~ 'Information Services',
grepl('Graduate assistant', industry) ~ 'Information Services',
grepl('It security', industry) ~ 'Information Services',
grepl('lab', industry) ~ 'Information Services', grepl('Print', industry) ~ 'Information Services',
grepl('Chemistry', industry) ~ 'Information Services',
grepl('Gaming', industry) ~ 'Information Services',
grepl('Software', industry) ~ 'Information Services',
grepl('Journalism', industry) ~ 'Information Services',
grepl('journalism', industry) ~ 'Information Services',
grepl('Work-Study', industry) ~ 'Information Services',
grepl('Geospatial', industry) ~ 'Information Services',
grepl('Virtual reality', industry) ~ 'Information Services',
grepl('Survey', industry) ~ 'Information Services',
grepl('PR', industry) ~ 'Information Services',
grepl('Quality Assurance', industry) ~ 'Information Services',
#Human Services
grepl('Gov', industry) ~ 'Human Services',
grepl('gov', industry) ~ 'Human Services',
grepl('Volun', industry) ~ 'Human Services',
grepl('Policy', industry) ~ 'Human Services',
grepl('City', industry) ~ 'Human Services',
grepl('Federal', industry) ~ 'Human Services',
grepl('federal', industry) ~ 'Human Services',
grepl('State', industry) ~ 'Human Services',
grepl('Grant', industry) ~ 'Human Services',
grepl('Nonprofit', industry) ~ 'Human Services',
grepl('Non-profit', industry) ~ 'Human Services',
grepl('Workforce', industry) ~ 'Human Services',
grepl('Social Work', industry) ~ 'Human Services',
grepl('Fundrais', industry) ~ 'Human Services',
grepl('Environ', industry) ~ 'Human Services',
grepl('environ', industry) ~ 'Human Services',
grepl('Natural', industry) ~ 'Human Services',
grepl('not for profit', industry) ~ 'Human Services',
grepl('Non Profit', industry) ~ 'Human Services',
grepl('Non profit', industry) ~ 'Human Services',
grepl('Military', industry) ~ 'Human Services',
grepl('Ministry', industry) ~ 'Human Services',
grepl('Clergy', industry) ~ 'Human Services',
grepl('Chaplain', industry) ~ 'Human Services',
grepl('Foundation', industry) ~ 'Human Services',
grepl('Religio', industry) ~ 'Human Services',
grepl('Heritage', industry) ~ 'Human Services',
grepl('AmeriCorps', industry) ~ 'Human Services',
grepl('Intelligence', industry) ~ 'Human Services',
grepl('Museum', industry) ~ 'Human Services',
grepl('Zoo', industry) ~ 'Human Services',
grepl('Philanthropy', industry) ~ 'Human Services',
grepl('International development', industry) ~ 'Human Services',
grepl('Historic Preservation', industry) ~ 'Human Services',
grepl('Maritime', industry) ~ 'Human Services',
grepl('Church', industry) ~ 'Human Services',
grepl('Association', industry) ~ 'Human Services',
grepl('association', industry) ~ 'Human Services',
grepl('Politic', industry) ~ 'Human Services',
grepl('politic', industry) ~ 'Human Services',
grepl('Defense', industry) ~ 'Human Services',
grepl('union', industry) ~ 'Human Services',
grepl('Union', industry) ~ 'Human Services',
grepl('Labor', industry) ~ 'Human Services',
grepl('Funding', industry) ~ 'Human Services',
grepl('Performing Arts', industry) ~ 'Human Services',
grepl('Hosp', industry) ~ 'Human Services',
grepl('parent', industry) ~ 'Human Services',
#Service
grepl('Law', industry) ~ 'Service',
grepl('Legal', industry) ~ 'Service',
grepl('attorney', industry) ~ 'Service',
grepl('Psychol', industry) ~ 'Service',
grepl('Counseling', industry) ~ 'Service',
grepl('Doc', industry) ~ 'Service',
grepl('Forensics', industry) ~ 'Service',
grepl('Toxicology', industry) ~ 'Service',
grepl('Medical', industry) ~ 'Service',
grepl('Manage', industry) ~ 'Service',
grepl('manage', industry) ~ 'Service',
grepl('Admin', industry) ~ 'Service',
grepl('Reception', industry) ~ 'Service',
grepl('design', industry) ~ 'Service',
grepl('Sale', industry) ~ 'Service',
grepl('sale', industry) ~ 'Service',
grepl('Buyer', industry) ~ 'Service',
grepl('Bank', industry) ~ 'Service',
grepl('finance', industry) ~ 'Service',
grepl('trading', industry) ~ 'Service',
grepl('Archi', industry) ~ 'Service',
grepl('archi', industry) ~ 'Service',
grepl('product development', industry) ~ 'Service',
grepl('Equipment', industry) ~ 'Service',
grepl('Health', industry) ~ 'Service',
grepl('health', industry) ~ 'Service',
grepl('Retail', industry) ~ 'Service',
grepl('Tour', industry) ~ 'Service',
grepl('Screening', industry) ~ 'Service',
grepl('Insur', industry) ~ 'Service',
grepl('Hospitality', industry) ~ 'Service',
grepl('Consult', industry) ~ 'Service',
grepl('consult', industry) ~ 'Service',
grepl('Rideshare', industry) ~ 'Service',
grepl('Transp', industry) ~ 'Service',
grepl('Logist', industry) ~ 'Service',
grepl('Distribution', industry) ~ 'Service',
grepl('Deliver', industry) ~ 'Service',
grepl('deliver', industry) ~ 'Service',
grepl('sustainability', industry) ~ 'Service',
grepl('Pharma', industry) ~ 'Service',
grepl('Planning', industry) ~ 'Service',
grepl('Recruit', industry) ~ 'Service',
grepl('HR', industry) ~ 'Service',
grepl('Equity', industry) ~ 'Service',
grepl('Human', industry) ~ 'Service',
grepl('Childcare', industry) ~ 'Service',
grepl('Child care', industry) ~ 'Service',
grepl('Child Care', industry) ~ 'Service',
grepl('Telecom', industry) ~ 'Service',
grepl('parking', industry) ~ 'Service',
grepl('restoration', industry) ~ 'Service',
grepl('Homemaker', industry) ~ 'Service',
grepl('Caregiver', industry) ~ 'Service',
grepl('Landscap', industry) ~ 'Service',
grepl('landscap', industry) ~ 'Service',
grepl('Repair', industry) ~ 'Service',
grepl('repair', industry) ~ 'Service',
grepl('Facilities', industry) ~ 'Service',
grepl('engraving', industry) ~ 'Service',
grepl('Call center', industry) ~ 'Service',
grepl('Wholesale', industry) ~ 'Service',
grepl('wholesale', industry) ~ 'Service',
grepl('CALL', industry) ~ 'Service',
grepl('Pharma', industry) ~ 'Service',
grepl('pharma', industry) ~ 'Service',
grepl('Regulator', industry) ~ 'Service',
grepl('Entertain', industry) ~ 'Service',
grepl('Sport', industry) ~ 'Service',
grepl('Music', industry) ~ 'Service',
grepl('music', industry) ~ 'Service',
grepl('I work for Indeed.com', industry) ~ 'Service',
grepl('Interpret', industry) ~ 'Service',
grepl('Funeral', industry) ~ 'Service',
grepl('funeral', industry) ~ 'Service',
grepl('Translation', industry) ~ 'Service',
grepl('Concrete', industry) ~ 'Service',
grepl('ecomm', industry) ~ 'Service',
grepl('E-comm', industry) ~ 'Service',
grepl('E-Comm', industry) ~ 'Service',
grepl('e-comm', industry) ~ 'Service',
grepl('E comm', industry) ~ 'Service',
grepl('EComm', industry) ~ 'Service',
grepl('Ecomm', industry) ~ 'Service',
grepl('Pest', industry) ~ 'Service',
grepl('Service', industry) ~ 'Service',
grepl('service', industry) ~ 'Service',
grepl('Cleaning', industry) ~ 'Service',
grepl('Saas', industry) ~ 'Service',
grepl('compli', industry) ~ 'Service',
grepl('Assisting', industry) ~ 'Service',
grepl('Apparel', industry) ~ 'Service',
grepl('appraisal', industry) ~ 'Service',
grepl('Compli', industry) ~ 'Service',
grepl('Business', industry) ~ 'Service',
grepl('Strategy', industry) ~ 'Service',
grepl('Eap', industry) ~ 'Service',
grepl('Investing', industry) ~ 'Service',
grepl('Storage', industry) ~ 'Service',
grepl('Supply', industry) ~ 'Service',
grepl('Procur', industry) ~ 'Service',
grepl('Airline', industry) ~ 'Service',
grepl('Aerospace', industry) ~ 'Service',
grepl('aviat', industry) ~ 'Service',
grepl('Aviat', industry) ~ 'Service',
grepl('Importing', industry) ~ 'Service',
grepl('Staffing', industry) ~ 'Service',
grepl('Cannabis', industry) ~ 'Service',
grepl('warehous', industry) ~ 'Service',
grepl('Warehous', industry) ~ 'Service',
grepl('Rental', industry) ~ 'Service',
grepl('Food', industry) ~ 'Service',
grepl('food', industry) ~ 'Service',
grepl('cook', industry) ~ 'Service',
grepl('Beauty', industry) ~ 'Service',
grepl('safety', industry) ~ 'Service',
grepl('protect', industry) ~ 'Service',
grepl('Operations', industry) ~ 'Service',
grepl('Real Estate', industry) ~ 'Service',
grepl('Real estate', industry) ~ 'Service',
grepl('real estate', industry) ~ 'Service',
grepl('Fitness', industry) ~ 'Service',
grepl('Gym', industry) ~ 'Service',
grepl('accessibility', industry) ~ 'Service',
grepl('Accessibility', industry) ~ 'Service',
grepl('Executive', industry) ~ 'Service',
grepl('fitness', industry) ~ 'Service',
grepl('Security', industry) ~ 'Service',
grepl('Per Sit', industry) ~ 'Service',
grepl('Pet', industry) ~ 'Service',
grepl('Vet', industry) ~ 'Service',
grepl('Animal', industry) ~ 'Service',
grepl('restaurant', industry) ~ 'Service',
grepl('Restaurant', industry) ~ 'Service',
grepl('Auction', industry) ~ 'Service',
grepl('Mortgage', industry) ~ 'Service',
grepl('mortgage', industry) ~ 'Service',
grepl('Tax', industry) ~ 'Service',
grepl('Film', industry) ~ 'Service',
grepl('Travel', industry) ~ 'Service',
grepl('Gambl', industry) ~ 'Service',
grepl('office', industry) ~ 'Service',
grepl('Office', industry) ~ 'Service',
grepl('recycling', industry) ~ 'Service',
grepl('Fashion', industry) ~ 'Service',
grepl('Finance', industry) ~ 'Service',
grepl('coach', industry) ~ 'Service',
grepl('Corporate', industry) ~ 'Service',
grepl('camp', industry) ~ 'Service',
grepl('Cosmetology', industry) ~ 'Service',
grepl('Purchasing', industry) ~ 'Service',
grepl('Printing', industry) ~ 'Service',
grepl('Global Mobility', industry) ~ 'Service',
grepl('professional', industry) ~ 'Service',
grepl('Horticulture', industry) ~ 'Service',
grepl('Maintenance', industry) ~ 'Service',
grepl('Janitorial', industry) ~ 'Service'
))
table(sal_US1$indust_grp)
sal_US1 <- sal_US1[!is.na(sal_US1$indust_grp),]
sal_US1 = subset(sal_US1, select = -c(industry))
#create the title_context_flg field
sal_US1 <- sal_US1 |>
mutate(title_context_flg=if_else(!is.na(title_context),1,0))
sal_US1$title_context_flg <- as_factor(sal_US1$title_context_flg)
#create the income_context_flg field
sal_US1 <- sal_US1 |>
mutate(income_context_flg=if_else(!is.na(income_context),1,0))
sal_US1$income_context_flg <- as_factor(sal_US1$income_context_flg)
sal_US1 = subset(sal_US1, select = -c(income_context, title_context))
table(sal_US1$title_context_flg)
table(sal_US1$income_context_flg)
sal_US1<-sal_US1[rowSums(is.na(sal_US1[, c("race", "edu", "gender")])) < 3,]
sal_US1 <- sal_US1 |>
mutate(gender=if_else(grepl('refer', gender) | is.na(gender),"Prefer not to answer",gender))
unique(sal_US1$gender)
sal_US1 <- sal_US1 |>
mutate(edu=if_else(is.na(edu),"Prefer not to answer",edu))
unique(sal_US1$edu)
sal_US1 <- sal_US1 |>
mutate(race=if_else(is.na(race),"Another or prefer not to answer",race))
sal_US1 <- sal_US1 |>
mutate(White=if_else(grepl('White', race),1,0)) |>
mutate(AsianOrAsianAm=if_else(grepl('Asian', race),1,0)) |>
mutate(NatAmOrAlaskaNat=if_else(grepl('Native', race),1,0)) |>
mutate(BlackOrAficanAm=if_else(grepl('Black', race),1,0)) |>
mutate(MiddleEorNAfrican=if_else(grepl('Middle', race),1,0))|>
mutate(PreferNot=if_else(grepl('prefer', race),1,0)) |>
mutate(HispLatSp=if_else(grepl('Hisp', race),1,0))
sal_US1 = subset(sal_US1, select = -c(race))
p10<-ggplot(sal_US1, aes(fill=hi_comp, x=White)) +
geom_bar(position="fill", stat="count")+
ggtitle("White")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p11<-ggplot(sal_US1, aes(fill=hi_comp, x=AsianOrAsianAm)) +
geom_bar(position="fill", stat="count")+
ggtitle("Asian or Asian American")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p12<-ggplot(sal_US1, aes(fill=hi_comp, x=NatAmOrAlaskaNat)) +
geom_bar(position="fill", stat="count")+
ggtitle("Native American or Alaska Native")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p13<-ggplot(sal_US1, aes(fill=hi_comp, x=BlackOrAficanAm)) +
geom_bar(position="fill", stat="count")+
ggtitle("Black or African American")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p14<-ggplot(sal_US1, aes(fill=hi_comp, x=MiddleEorNAfrican)) +
geom_bar(position="fill", stat="count")+
ggtitle("Middle Eastern or Northern African")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p15<-ggplot(sal_US1, aes(fill=hi_comp, x=HispLatSp)) +
geom_bar(position="fill", stat="count")+
ggtitle("Hispanic, Latino, or Spanish origin")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
p16<-ggplot(sal_US1, aes(fill=hi_comp, x=PreferNot)) +
geom_bar(position="fill", stat="count")+
ggtitle("Another or prefer not to answer")+ coord_flip()+
scale_fill_manual(values=theColors)+
theme(plot.title = element_text(size = 8), axis.title.y=element_blank(), legend.title=element_text(size=8))
ggarrange(p10,p11,p12,p13,p14,p15,p16, nrow=3, ncol = 3)
#create the add_comp_flg field
sal_US1 <- sal_US1 |>
mutate(add_comp_flg=if_else(!is.na(add_comp),1,0))
sal_US1$add_comp_flg <- as_factor(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(add_comp))
table(sal_US1$add_comp_flg)
#remove add_comp
sal_US1 = subset(sal_US1, select = -c(timestamp, job_title))
#Outliers
#create new variable to flag potential outliers
sal_US$annual_salOUT<-scores(sal_US$annual_sal, type='iqr', lim=1.5)
sal_US$total_compOUT<-scores(sal_US$total_comp, type='iqr', lim=1.5)
#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(sal_US[grep('OUT',names(sal_US))]))
# Correlation Funnel
sal_US1 %>%
na.omit() |>
binarize(n_bins = 4, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) |>
correlate(hi_comp__1) |>
plot_correlation_funnel(alpha = 0.7, interactive = TRUE)