timestamp
Variableage_group
Variableindustry
Variablecountry
Variableregionus_state
Variablecity Variablework_exp_overall and work_exp_field
Variableseducation and gender Variablesrace Variablesalary
Variablecurrency
Variablesalary_USDPurpose
This analysis aims to investigate factors associated with salary variability.
It is important to note that the data was collected through an independent online survey. While I had no influence on the study design, question selection, or participant selection process, I dedicated time, initially, to assessing the survey design and identified specific questions and hypotheses to explore.
Specific Questions
(Please note, this version of the report has been condensed for brevity as its purpose is to act as a sample of my analysis abilities. It only features analysis of the relationship between salary and region. This reduction aims to enhance readability while providing a demonstration of my practical abilities)
The dataset is sourced from a public online survey conducted at
www.askamanager.org.
The survey creators were interested in salary variability and trends across different dimensions. Participants self-selected to fill out the survey, and their responses are entirely anonymous.
Beginning in 2021, the dataset became “live” and, to this day, it is continually growing as new people respond to the survey. The analysis for this project applies to data sourced on 17.02.2024.
The raw dataset is available here.
This analysis was conducted using the statistical computing environment, R.
Initially, the raw data frame was imported into R as a CSV file and
saved as an object called raw_data.
The analysis proceeded through the following stages:
Reporting
In addition to presenting the analysis results, this report includes a comprehensive account of all R code used for reproducibility and clarity.
For a concise and visual summary of the most notable results only, please refer to this graphical dashboard
Import CSV file as raw_data, and define a new version,
salary_data, for analysis.
raw_data <- read.csv("/Users/jameslauder/Desktop/Data Learning/Data Sets/Ask A Manager Salary Survey 2021 (Personal Copy - est. 106.02.24)) - Form Responses 1.csv")
salary_data <- raw_dataRename variables to more compact names.
column_names = c("timestamp", "age_group", "industry", "job_title", "job_title_context", "salary", "compensation", "currency", "other_currency", "income_context", "country", "us_state", "city", "work_exp_overall", "work_exp_field", "education", "gender", "race")
colnames(salary_data) = column_namesAppend a unique ID variable to dataframe.
Check initial structure of data frame.
## Rows: 28,014
## Columns: 19
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ timestamp <chr> "4/27/2021 11:02:10", "4/27/2021 11:02:22", "4/27/20…
## $ age_group <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34"…
## $ industry <chr> "Education (Higher Education)", "Computing or Tech",…
## $ job_title <chr> "Research and Instruction Librarian", "Change & Inte…
## $ job_title_context <chr> "", "", "", "", "", "", "", "High school, FT", "Data…
## $ salary <chr> "55,000", "54,600", "34,000", "62,000", "60,000", "6…
## $ compensation <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, …
## $ currency <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "US…
## $ other_currency <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ income_context <chr> "", "", "", "", "", "", "", "", "", "I work for a Ch…
## $ country <chr> "United States", "United Kingdom", "US", "USA", "US"…
## $ us_state <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "Sout…
## $ city <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "…
## $ work_exp_overall <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 …
## $ work_exp_field <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years"…
## $ education <chr> "Master's degree", "College degree", "College degree…
## $ gender <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "M…
## $ race <chr> "White", "White", "White", "White", "White", "White"…
## [1] 0
No full duplicates present. Most likely due to every entry having a different time-stamp at least.
Duplicates after disregarding: id,
timestamp, job_title_context,
income_context (essentially true duplicates still).
variable_vec <- c("age_group", "industry", "job_title", "compensation", "currency", "other_currency", "country", "us_state", "city", "work_exp_overall", "work_exp_field", "education", "gender", "race", "salary")
data_dup_logical <- duplicated(salary_data[variable_vec]) | duplicated(salary_data[variable_vec], fromLast = TRUE)
paste(sum(data_dup_logical), "partial duplicates?")## [1] "292 partial duplicates?"
Observe the partial duplicates as a new data frame,
data_dup (large output - excluded from report).
Sort duplicates dataframe by a series of variables, for easier comparison. Truncate long values. Observe first 20 rows only, for convenience.
#sort dataframe by a series of variables
data_dup_sorted <- data_dup %>%
arrange(age_group, industry, job_title, compensation, currency, other_currency, country,
us_state, city, work_exp_overall, work_exp_field, education, gender, race, salary)
#truncate long values to make easier to read data frame
max_length = 10
data_dup_sorted_tr <- data_dup_sorted %>%
mutate_at(c("age_group", "industry", "job_title", "job_title_context","compensation", "currency",
"other_currency", "income_context", "country", "us_state", "city", "work_exp_overall",
"work_exp_field", "education", "gender", "race", "salary"), ~substr(., 1, max_length))
data_dup_sorted_tr[1:20,]## id timestamp age_group industry job_title job_title_context
## 1 16604 4/28/2021 18:15:06 18-24 Business o Certified
## 2 16615 4/28/2021 18:16:06 18-24 Business o Certified
## 3 19523 4/29/2021 10:06:16 18-24 Computing Software e Rotational
## 4 19526 4/29/2021 10:07:16 18-24 Computing Software e Rotational
## 5 14476 4/28/2021 14:46:47 18-24 Engineerin Structural
## 6 14498 4/28/2021 14:50:20 18-24 Engineerin Structural
## 7 19280 4/29/2021 8:42:29 18-24 Law Paralegal
## 8 19288 4/29/2021 8:44:21 18-24 Law Paralegal
## 9 416 4/27/2021 11:09:38 18-24 Law Enforc Library As
## 10 1770 4/27/2021 11:31:35 18-24 Law Enforc Library As
## 11 25341 5/6/2021 13:02:25 18-24 Nonprofits Program Co
## 12 25342 5/6/2021 13:02:38 18-24 Nonprofits Program Co
## 13 14748 4/28/2021 15:23:15 18-24 Nonprofits Senior Res
## 14 14755 4/28/2021 15:23:58 18-24 Nonprofits Senior Res
## 15 18243 4/28/2021 23:18:11 18-24 Retail Area sales Children’s
## 16 18255 4/28/2021 23:22:53 18-24 Retail Area sales Department
## 17 15541 4/28/2021 17:10:31 18-24 Utilities Policy Ana
## 18 15564 4/28/2021 17:11:12 18-24 Utilities Policy Ana
## 19 26777 9/19/2021 0:37:02 25-34 Accounting Chief of S
## 20 26778 9/19/2021 0:38:32 25-34 Accounting Chief of S
## salary compensation currency other_currency income_context country
## 1 63,000 45000 USD USA
## 2 63,000 45000 USD USA
## 3 118,000 <NA> USD United Sta
## 4 118,000 <NA> USD United Sta
## 5 75,600 0 USD USA
## 6 75,600 0 USD USA
## 7 25,000 <NA> GBP United Kin
## 8 25,000 <NA> GBP United Kin
## 9 49,000 1000 USD United Sta
## 10 49,000 1000 USD United Sta
## 11 46425 <NA> USD USA
## 12 46425 <NA> USD USA
## 13 49,500 <NA> USD United Sta
## 14 49,500 <NA> USD United Sta
## 15 49,000 <NA> USD Work typic United Sta
## 16 49,000 <NA> USD Typically United Sta
## 17 57,000 1100 USD US
## 18 57,000 1100 USD US
## 19 70000 <NA> USD United Sta
## 20 70000 <NA> USD United Sta
## us_state city work_exp_overall work_exp_field education gender
## 1 Minnesota Minneapoli 5-7 years 2 - 4 year Master's d Man
## 2 Minnesota Minneapoli 5-7 years 2 - 4 year Master's d Man
## 3 New York New York 2 - 4 year 1 year or College de Woman
## 4 New York New York 2 - 4 year 1 year or College de Woman
## 5 New York New York c 2 - 4 year 2 - 4 year Master's d Woman
## 6 New York New York c 2 - 4 year 2 - 4 year Master's d Woman
## 7 Bristol 5-7 years 2 - 4 year High Schoo Woman
## 8 Bristol 5-7 years 2 - 4 year High Schoo Woman
## 9 Massachuse Boston 2 - 4 year 2 - 4 year College de Woman
## 10 Massachuse Boston 2 - 4 year 2 - 4 year College de Woman
## 11 New York Albany 2 - 4 year 1 year or College de Woman
## 12 New York Albany 2 - 4 year 1 year or College de Woman
## 13 Maryland Bethesda 5-7 years 2 - 4 year College de Woman
## 14 Maryland Bethesda 5-7 years 2 - 4 year College de Woman
## 15 Texas Austin 1 year or 1 year or College de Woman
## 16 Texas Austin 1 year or 1 year or College de Woman
## 17 Virginia Charlottes 2 - 4 year 2 - 4 year College de Woman
## 18 Virginia Charlottes 2 - 4 year 2 - 4 year College de Woman
## 19 Kentucky Louisville 8 - 10 yea 5-7 years Master's d Woman
## 20 Kentucky Louisville 8 - 10 yea 5-7 years Master's d Woman
## race
## 1 Asian or A
## 2 Asian or A
## 3 White
## 4 White
## 5 White
## 6 White
## 7 White
## 8 White
## 9 White
## 10 White
## 11 Asian or A
## 12 Asian or A
## 13 Black or A
## 14 Black or A
## 15 White
## 16 White
## 17 Hispanic,
## 18 Hispanic,
## 19 White
## 20 White
We notice a very interesting trend in that the majority, if not all duplicates, occurred within very close duration of each other (mostly within a matter of minutes or hours), indicating they are indeed most likely legitimate duplicates from the same respective person each time.
I observed every single one of the duplicates and concluded that they all require removal (153 removals total (0.55% of data)).
We call our new data frame: salary_data2
salary_data2 <- salary_data %>%
distinct(age_group, industry, job_title, compensation, currency, other_currency,
country, us_state, city, work_exp_overall, work_exp_field, education,
race, gender, salary, .keep_all = TRUE)Check number of NA values present for each variable.
## id timestamp age_group industry
## 0 0 0 0
## job_title job_title_context salary compensation
## 0 1 0 7235
## currency other_currency income_context country
## 0 3 2 0
## us_state city work_exp_overall work_exp_field
## 0 16 0 0
## education gender race
## 0 0 0
Check number of blank (whitespace) values present for each variable.
## id timestamp age_group industry
## 0 0 0 73
## job_title job_title_context salary compensation
## 0 20661 0 0
## currency other_currency income_context country
## 0 27652 24831 0
## us_state city work_exp_overall work_exp_field
## 4985 8 0 0
## education gender race
## 214 168 171
timestamp VariableFormat values so that they are recognised as class: ‘POSIXct’.
Create a new vector, date, with only the date (no time),
and bind to beginning of salary_data2.
date <- as.Date(salary_data2$timestamp, format = "%m/%d/%Y")
salary_data2 <- salary_data2 %>%
mutate(date = date) %>%
relocate(date, .before = timestamp)age_group VariableConvert to factor then check for missing values (NAs or white space).
#convert to ordered factor then view contingency table
salary_data2$age_group <- factor(salary_data2$age_group, ordered = TRUE, levels = c("under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or over"))
table(salary_data2$age_group)##
## under 18 18-24 25-34 35-44 45-54 55-64 65 or over
## 13 1197 12558 9845 3170 984 94
#confirm there are no NAs or blanks
nas <- sum(is.na(salary_data2$age_group))
blanks <- sum(grepl("^\\s*$" ,salary_data2$age_group))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 0NULL
industry VariableConvert to factor then check for missing values (NAs or white space).
#check whether there are NAs or blanks
salary_data2$industry <- as.factor(salary_data2$industry)
nas <- sum(is.na(salary_data2$industry))
blanks <- sum(grepl("^\\s*$" ,salary_data2$industry))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 73NULL
73 blank values present. Convert these to ‘Unspecified’ values.
salary_data2$industry <- gsub("^\\s*$", "Unspecified", salary_data2$industry, ignore.case = TRUE)
#convert back to factor
salary_data2$industry <- as.factor(salary_data2$industry)Observe the number of levels for industry.
## [1] 1218
Collapsing of values required to drastically reduce number of levels. Keep the 40 most frequent responses and collapse rest into ‘other’ category.
Observe the existing factor levels arranged by frequency (display top 40).
#set number of levels to retain
industry_levels = 40
#collapse the remainder into singular level called 'other'
collapsed_factor <- fct_lump_n(salary_data2$industry, industry_levels)
collapsed_factor_df <- data.frame(collapsed_factor)
collapsed_factor_df %>% count(collapsed_factor) %>% arrange(desc(n))## collapsed_factor n
## 1 Computing or Tech 4662
## 2 Education (Higher Education) 2454
## 3 Nonprofits 2401
## 4 Health care 1881
## 5 Government and Public Administration 1879
## 6 Accounting, Banking & Finance 1788
## 7 Other 1741
## 8 Engineering or Manufacturing 1686
## 9 Marketing, Advertising & PR 1118
## 10 Law 1093
## 11 Business or Consulting 842
## 12 Education (Primary/Secondary) 830
## 13 Media & Digital 769
## 14 Insurance 526
## 15 Retail 500
## 16 Recruitment or HR 459
## 17 Property or Construction 384
## 18 Art & Design 353
## 19 Utilities & Telecommunications 352
## 20 Transport or Logistics 301
## 21 Sales 285
## 22 Social Work 272
## 23 Hospitality & Events 260
## 24 Entertainment 252
## 25 Agriculture or Forestry 138
## 26 Leisure, Sport & Tourism 97
## 27 Unspecified 73
## 28 Publishing 56
## 29 Library 51
## 30 Libraries 50
## 31 Biotech 49
## 32 Law Enforcement & Security 42
## 33 Public Library 35
## 34 Research 33
## 35 Manufacturing 26
## 36 Pharmaceuticals 24
## 37 Architecture 21
## 38 Real Estate 21
## 39 Pharmaceutical 20
## 40 Public Libraries 20
## 41 Pharma 17
The following new levels were established, based on an iterative process of combining equivalent levels (essentially the same responses but spelled or formatted differently) and then reassessing contingency table of levels (and making target searches for key words using the code below),
#Archived example code used to find string matches and determine their relative frequency
#matches = grepl("manufac", temp$industry, ignore.case = TRUE)
#temp$industry[matches]
#code used to assign new factor levels
salary_data2$industry <- gsub(".*Librar.*", "Library", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Biotech.*", "Pharmaceutical and Biotechnology", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Pharm.*", "Pharmaceutical and Biotechnology", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Research.*", "Research", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Manufac.*", "Engineering or Manufacturing", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Sci.*", "Sciences", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Estate.*", "Property or Construction", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Vet.*", "Veterinary", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Politic.*", "Politics", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Enviro.*", "Environmental Related", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Oil.*", "Oil and Gas", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Energy.*", "Energy", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Auto.*", "Automotive", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Architect.*", "Architecture", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Food.*", "Food and Beverage", salary_data2$industry, ignore.case = TRUE)In the end, we retain 40 distinct levels and collapse the rest to make ‘Other’ category, resulting in 41 levels total.
country VariableConvert to factor then check for missing values (NAs or white space).
#confirm there are no NAs or blanks
salary_data2$country <- as.factor(salary_data2$country)
nas <- sum(is.na(salary_data2$country))
blanks <- sum(grepl("^\\s*$" ,salary_data2$country))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 0NULL
Observe the number of levels for country.
## [1] 373
A lot of formatting required, due to inconsistent entry of country names. Get a sense for data by observing the existing factor levels arranged by frequency (display top 40).
#set number of levels to retain before collapsing the remainder into singular level called 'other'
country_levels = 40
collapsed_factor <- fct_lump_n(salary_data2$country, country_levels)
collapsed_factor_df <- data.frame(collapsed_factor)
collapsed_factor_df %>% count(collapsed_factor) %>% arrange(desc(n))## collapsed_factor n
## 1 United States 8909
## 2 USA 7906
## 3 US 2591
## 4 Canada 1560
## 5 Other 907
## 6 United States 660
## 7 U.S. 577
## 8 UK 574
## 9 United Kingdom 542
## 10 USA 463
## 11 Usa 445
## 12 United States of America 425
## 13 Australia 316
## 14 United states 206
## 15 usa 181
## 16 Germany 172
## 17 England 133
## 18 united states 116
## 19 Us 104
## 20 Ireland 102
## 21 New Zealand 93
## 22 Uk 85
## 23 Canada 75
## 24 Australia 68
## 25 United Kingdom 65
## 26 France 63
## 27 U.S.A. 46
## 28 United States of America 44
## 29 Netherlands 43
## 30 Spain 43
## 31 Scotland 40
## 32 us 37
## 33 Sweden 34
## 34 Belgium 33
## 35 England 32
## 36 Switzerland 29
## 37 canada 27
## 38 Japan 26
## 39 The Netherlands 25
## 40 UK 22
## 41 America 21
## 42 U.S 21
The following new levels were established, based on an iterative
process of combining equivalent levels, with different formats, and then
reassessing contingency table of levels (and making target searches for
key words using the code below).
Note that there are still many values that remain unformatted, as it simply would take too long to catch every single one.
We leave their values as is, for now, rather than lumping into an ‘Other’ category.
#Example Ccde used to find string matches and determine their relative frequency
#matches = grepl("unite", salary_data2$country, ignore.case = TRUE)
#salary_data2$country[matches]
#Code used to assign new factor levels
salary_data2$country <- gsub(".*UAE.*|.*Emirates.*", "UAE", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*UK.*|.*U.K.*|.*United K.*|.*Great Bri.*|.*England.*|.*Ireland.*|.*Scotland.*|.*Wales.*", "UK", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*US.*|.*U.S.*|.*U. S.*|.*Unite.*", "USA", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Canada.*", "Canada", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Netherlands.*", "Netherlands", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*New Ze.*|.*NZ.*", "New Zealand", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*German.*", "Germany", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Sweden.*", "Sweden", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Switzer.*", "Switzerland", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Denmar.*", "Denmark", salary_data2$country, ignore.case = TRUE)
#restore 'country' to factor class again
salary_data2$country <- as.factor(salary_data2$country)Observe the resulting factor levels (most frequent only).
## # A tibble: 176 × 2
## # Groups: country [176]
## country n
## <fct> <int>
## 1 USA 23401
## 2 UK 1714
## 3 Canada 1668
## 4 Germany 195
## 5 New Zealand 129
## 6 Netherlands 88
## 7 France 63
## 8 Spain 43
## 9 Sweden 41
## 10 Switzerland 38
## # ℹ 166 more rows
regionDue to the large number of levels still remaining for the
country variable we establish a region
variable, by collapsing countries together. This may be more convenient
for analysis.
library(countrycode)
#establish a new vector for countries
countries <- salary_data2$country
#convert countries to their associated world region, using, 'countrycode' package. Store in new vector called region
region <- as.factor(countrycode(countries, "country.name", "region23"))
#collapse the Central America and Africa regions, respectively, as there are not enough entries in the seperated groups
region <- gsub("Caribbean|Central America|South America", "Central America, South America & Caribbean", region, ignore.case = TRUE)
region <- gsub(".*Africa.*", "Africa", region, ignore.case = TRUE)
#Join the resulting region vector to our data frame
salary_data2 <- salary_data2 %>%
mutate(region = as.factor(region)) %>%
relocate(region, .after = country)Observe the resulting contingency table for region.
(note: 80 remaining countries, as NA values, that could not be grouped due to their messy format).
## # A tibble: 13 × 2
## # Groups: region [13]
## region n
## <fct> <int>
## 1 Northern America 25071
## 2 Northern Europe 1819
## 3 Western Europe 426
## 4 Australia and New Zealand 129
## 5 Southern Europe 80
## 6 <NA> 80
## 7 South-Eastern Asia 51
## 8 Eastern Asia 49
## 9 Central America, South America & Caribbean 42
## 10 Africa 37
## 11 Southern Asia 30
## 12 Eastern Europe 25
## 13 Western Asia 22
us_state VariableConvert to factor then check for missing values (NAs or white space).
#confirm there are no NAs or blanks
salary_data2$us_state <- as.factor(salary_data2$us_state)
nas <- sum(is.na(salary_data2$us_state))
blanks <- sum(grepl("^\\s*$" , salary_data2$us_state))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 4985NULL
A lot of blanks present, presumably due to many people not living in USA.
Observe the number of levels for us_state.
## [1] 133
133 different levels present, due to the addition of state combinations produced from people selecting two or more states.
city VariableConvert to factor then check for missing values (NAs or white space).
#confirm there are no NAs or blanks
salary_data2$city <- as.factor(salary_data2$city)
nas <- sum(is.na(salary_data2$city))
blanks <- sum(grepl("^\\s*$" , salary_data2$city))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 16 blanks: 8NULL
A mixture of NAs and blanks present. Observe the number of levels for
city.
## [1] 4816
A huge amount of levels for city as it is a free text entry. Simply too much formatting required to clean this variable up to a usable state. If needed we can instead target specific entries within this variable.
work_exp_overall and work_exp_field
VariablesConvert to ordered factors then confirm whether there are no missing values (NAs or white space).
salary_data2$work_exp_overall <- factor(salary_data2$work_exp_overall, ordered = TRUE, levels = c("1 year or less", "2 - 4 years", "5-7 years", "8 - 10 years", "11 - 20 years", "21 - 30 years", "31 - 40 years", "41 years or more" ))
salary_data2$work_exp_field <- factor(salary_data2$work_exp_field, ordered = TRUE, levels = c("1 year or less", "2 - 4 years", "5-7 years", "8 - 10 years", "11 - 20 years", "21 - 30 years", "31 - 40 years", "41 years or more" ))
nas <- sum(is.na(salary_data2$work_exp_overall))
blanks <- sum(grepl("^\\s*$" , salary_data2$work_exp_overall))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 0NULL
nas <- sum(is.na(salary_data2$work_exp_field))
blanks <- sum(grepl("^\\s*$" , salary_data2$work_exp_field))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 0NULL
No missing values present for either factor.
education and gender VariablesEducation
Convert education to factor then observe whether there are any missing values (NAs or white space)
salary_data2$education <- as.factor(salary_data2$education)
nas <- sum(is.na(salary_data2$education))
blanks <- sum(grepl("^\\s*$" , salary_data2$education))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 214NULL
214 blank values. Assume these are people with education levels below that of our lowest option, ‘High school’. Convert these values to ‘None’ instead.
#convert blanks to "None"
salary_data2 <- salary_data2 %>%
mutate(education = if_else(education == "", "None", as.factor(education)))
#finally, convert to ordered factor
salary_data2$education <- factor(salary_data2$education, ordered = TRUE, levels = c("None", "High School", "Some college", "College degree", "Master's degree", "PhD", "Professional degree (MD, JD, etc.)"))Observe the contingency table.
## # A tibble: 7 × 2
## # Groups: education [7]
## education n
## <ord> <int>
## 1 College degree 13429
## 2 Master's degree 8808
## 3 Some college 2043
## 4 PhD 1419
## 5 Professional degree (MD, JD, etc.) 1317
## 6 High School 631
## 7 None 214
Gender
Convert gender to factor then observe whether there are any missing values (NAs or white space)
#confirm there are no NAs or blanks
salary_data2$gender <- as.factor(salary_data2$gender)
nas <- sum(is.na(salary_data2$gender))
blanks <- sum(grepl("^\\s*$" , salary_data2$gender))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 168NULL
168 blank values. Collapse this category into the existing category, ‘Other or prefer not to answer’
salary_data2 <- salary_data2 %>%
mutate(gender = if_else(gender %in% c("", "Prefer not to answer"), "Other or prefer not to answer", as.character(gender)))
#finally, convert to factor
desired_order <- c("Woman", "Man", "Non-binary", "Other or prefer not to answer")
salary_data2$gender <- factor(salary_data2$gender, levels = desired_order)Observe the contingency table.
## # A tibble: 4 × 2
## # Groups: gender [4]
## gender n
## <fct> <int>
## 1 Woman 21235
## 2 Man 5426
## 3 Non-binary 738
## 4 Other or prefer not to answer 462
race VariableConvert race to factor then observe whether there are any missing values (NAs or white space)
salary_data2$race <- as.factor(salary_data2$race)
nas <- sum(is.na(salary_data2$race))
blanks <- sum(grepl("^\\s*$" , salary_data2$race))
print(cat("NAs:", nas, " ", "blanks:", blanks))## NAs: 0 blanks: 171NULL
171 blank values. Combine them with the category: ‘Another option not listed here or prefer not to answer’
salary_data2 <- salary_data2 %>%
mutate(race = if_else(race == "", "Another option not listed here or prefer not to answer", as.character(race)))
salary_data2$race <- as.factor(salary_data2$race)50 different levels for race present, due to the
multitude of combinations. Assign a new variable:
race_main, which collapses the less frequent levels into
‘Other’ category, while keeping the most frequent 13.
#create new vector
race_levels = 13
race_main <- fct_lump_n(salary_data2$race, race_levels)
#join with data frame
salary_data2 <- salary_data2 %>%
mutate(race_main = race_main, .after = race)Observe the contingency table
## # A tibble: 14 × 2
## # Groups: race_main [14]
## race_main n
## <fct> <int>
## 1 White 23071
## 2 Asian or Asian American 1384
## 3 Another option not listed here or prefer not to answer 790
## 4 Black or African American 672
## 5 Hispanic, Latino, or Spanish origin 596
## 6 Hispanic, Latino, or Spanish origin, White 382
## 7 Asian or Asian American, White 343
## 8 Other 174
## 9 Black or African American, White 123
## 10 Middle Eastern or Northern African, White 82
## 11 Middle Eastern or Northern African 69
## 12 Native American or Alaska Native, White 67
## 13 White, Another option not listed here or prefer not to answer 65
## 14 Native American or Alaska Native 43
salary VariableSearch for blank values (consisting of zero or more white space
characters) present in salary.
## [1] 0
Remove “,” characters from salary values and convert
variable to numeric type.
No NAs present in numeric form of salary.
## [1] 0
currency VariableConvert to factor then observe the distribution.
salary_data2 <- salary_data2 %>%
mutate(currency = as.factor(currency))
salary_data2 %>%
group_by(currency) %>%
count(currency) %>%
arrange(desc(n))## # A tibble: 11 × 2
## # Groups: currency [11]
## currency n
## <fct> <int>
## 1 USD 23205
## 2 CAD 1661
## 3 GBP 1584
## 4 EUR 638
## 5 AUD/NZD 499
## 6 Other 157
## 7 CHF 37
## 8 SEK 37
## 9 JPY 23
## 10 ZAR 16
## 11 HKD 4
A wide range of currencies present.
To proceed with the analysis, we need to standardize
salary by converting all values to USD.
salary_USDUse the relevant exchange rate from the specific date of each respective survey entry.
Historical exchange rate data was sourced from Reserve Bank of New Zealand website, and uploaded as a CSV file.
The data dates back to January 2018, which covers the duration of this survey.
After importing the data, some initial formatting takes place.
exchange_data <- read.csv("/Users/jameslauder/Desktop/Data Learning/Personal Projects/Salary Survey Analysis/Exchange rates.csv")
exchange_data2 <- exchange_data %>%
#Rename variables
setNames(c("date", "TWI", "USD", "GBP", "AUS", "JPY", "EUR", "CAD", "KRW", "CNY", "MYR", "HKD", "IDR", "THB", "SGD", "TWD", "INR", "PHP", "VND", "his_TWI")) %>%
#format date column
mutate(date = as.Date(date, format = "%d %b %Y")) %>%
#remove commas from IDR and VND columns
mutate_at(vars(13:19), ~gsub(",", "", .)) %>%
#format currency conversion columns
mutate_at(vars(2:20), as.numeric) %>%
#Trim first four redundant rows off
slice(5:n())Join the new exchange_data2 data frame with the
survey_data2 data frame, using a left join, with
date as the key variable
# Perform a Left join, by date, to align the relevant daily exchange rates with each survey entry
salary_data2_join <- salary_data2 %>%
left_join(exchange_data2, by = "date")Note that there are multiple survey entries with no daily exchange rate data attached, due to their being no data for these specific days in our exchange rate data frame.
Remedy this by using the exchange rate data for the next available date.
#set initial parameters
#index for 1st variable to fill
start = which(colnames(salary_data2_join) == "TWI")
#index for last variable to fill
end = which(colnames(salary_data2_join) == "his_TWI")
#vector of the variables to fill
cols <- colnames(salary_data2_join)[start:end]
#vector of indeces of all the NA values (note all other variables have identical NA pattern to 'USD')
NAs <- which(is.na(salary_data2_join$USD))
#vector of indeces of not NA values
not_NAs <- which(!is.na(salary_data2_join$USD))
#Run a for loop which does the following:
#for each concerned variable
for(i in 1:length(cols)){
colname = cols[i]
#for every observation with an NA
for(j in NAs){
#assign the index of the next true value to use as a replacement
next_value = min(not_NAs[not_NAs > i])
#Perform the replacement
salary_data2_join[[colname]][j] = salary_data2_join[[colname]][next_value]
}
}There are a few relevant currencies which the
data_exchanges2 data frame did not have data on.
Bind additional columns on for those currencies. Use a more crude method—averaged out exchange rate, over a relevant period.
salary_data2_join <- salary_data2_join %>%
mutate(NZD = 1, .after = VND) %>%
mutate(CHF = 0.6465, .after = VND) %>%
mutate(ZAR = 10.4566, .after = VND) %>%
mutate(SEK = 6.04, .after = VND)Create a new variable called salary_USD with all salary
values converted from native currency to USD.
Perform the salary conversions using a for loop and if/else statements.
Note that the exchange rate data only includes the exchange rates between various countries and NZD, therefore first convert all currencies to NZD, then subsequently convert to USD.
#Create empty vector to be populated with converted USD salary values
salary_USD = numeric(nrow(salary_data2_join))
#for every observation
for(i in 1:nrow(salary_data2_join)){
#if currency value is not "other"
if(salary_data2_join[i,"currency"] != "Other"){
#extract the type of currency
currency_value <- as.character(salary_data2_join[i,"currency"])
#Convert any values of "AUD/NZD" to "NZD" so that we can then access that specific column's exchange rate (note we do not actually have any Australian entries, interestingly)
if(currency_value == "AUD/NZD"){
currency_value = "NZD"
}
#reference the applicable exchange rate by using the 'currency' value to select the appropriate column
exchange_value <- salary_data2_join[[currency_value]][i]
#extract the salary value for this observation
salary <- salary_data2_join[i,"salary"]
#perform calculation to convert native salary to USD
salary_USD[i] <- (salary / exchange_value) * salary_data2_join[i,"USD"]
#otherwise, if "Other", #set value to NA for now
}else{
salary_USD[i] = NA
}
}
#round the resulting variable and bind to initial data frame
salary_data2 <- salary_data2 %>%
mutate(salary_USD = round(salary_USD)) %>%
relocate(salary_USD, .after = salary)salary_USDInitial Observations and Outlier Treatment
Observe a summary of salary_USD quantiles.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 54015 75000 92117 107000 102000000 157
Boxplot assessment.
One extremely large value affecting the visualization.
Observe the largest values to gain some more perspective.
#Observe the top 20 values of the data frame
salary_data2 %>%
select(salary_USD, currency, country, job_title) %>%
mutate(job_title = substr(job_title, 1, 40)) %>%
arrange(desc(salary_USD)) %>%
slice_head(n = 20)## salary_USD currency country job_title
## 1 102000000 USD Colombia Operations Manager
## 2 5000044 USD USA Inside sales manager
## 3 3000000 USD USA Owner and CEO
## 4 2111538 USD Singapore Product Manager
## 5 1900000 USD USA Attending Physician (general internal me
## 6 1650000 USD USA Principal Software Engineer
## 7 1334782 USD USA Senior Policy Advisor
## 8 1318422 GBP UK Consultant
## 9 1300000 USD USA Partner
## 10 1263782 GBP UK Agricultural Supply line Negotiating con
## 11 1260000 USD USA Senior Consultant
## 12 1250000 USD USA Marketing Manager
## 13 1200000 USD USA Partner
## 14 1126141 CHF India Sr. Research associate
## 15 1100000 USD USA Software Engineer
## 16 1100000 USD USA Partner
## 17 1026691 EUR Germany Lead developer
## 18 956203 GBP Norway Purchaser
## 19 954000 USD USA Business Systems Analyst III
## 20 950000 USD USA Director, Software Development
Very large values present. Assume there is a mistake with the
currency value for the unrealistic top salary value of
$102,000,000. Respondent may have mistakingly selected USD. Therefore,
currency conversion algorithm has not occurred. Manually convert from
Peso to USD using relevant exchange rate. Resulting value of US$25500 is
much more realistic.
salary_data2 <- salary_data2 %>%
mutate(salary_USD = ifelse(salary_USD == 102000000, salary_USD * 0.00025, salary_USD))While addressing unrealistic values, perform a quick initial check of lowest values.
salary_data2 %>%
select(salary_USD, currency, country, job_title) %>%
mutate(job_title = substr(job_title, 1, 40)) %>%
filter(salary_USD <= 1000) %>%
arrange(salary_USD)## salary_USD currency country job_title
## 1 0 USD USA "mum" ;)
## 2 0 USD USA Executive Director
## 3 0 USD USA Attorney
## 4 0 USD USA Student teacher
## 5 0 USD USA Product Marketer
## 6 0 USD USA Househusband
## 7 0 USD USA Founder
## 8 0 USD USA Unemployed
## 9 0 USD USA Government Relations Director
## 10 0 USD USA Realtor
## 11 0 USD na na
## 12 0 USD USA College Senior
## 13 0 USD USA Student
## 14 0 USD USA Homemaker
## 15 0 USD USA data science student
## 16 0 USD USA N/A
## 17 1 USD USA Software Development Lead
## 18 3 ZAR spain chief executive official
## 19 4 USD ss sdsd
## 20 15 USD dbfemf mn jj
## 21 22 EUR Netherlands Programmer
## 22 35 USD USA Operations Manager
## 23 36 EUR spain Auditor
## 24 36 USD USA Marketing Coordinator
## 25 38 USD USA Special Education Teacher
## 26 39 GBP UK Media executive
## 27 40 EUR Germany Project Manager / Account Manager
## 28 40 USD USA Assistant Registrar
## 29 40 USD USA HR Administrator/Generalist
## 30 40 USD USA Community Project Manager
## 31 40 USD USA Territory Manager
## 32 42 EUR finland occupational therapist
## 33 42 CAD Canada Technician
## 34 43 EUR Netherlands Compliance & Risk Officer
## 35 43 EUR UK Games Specialist
## 36 44 USD USA Scribe
## 37 45 EUR France FP&A
## 38 45 USD USA Finance Assistant
## 39 48 EUR Italy Cloud Consultant
## 40 50 USD USA Content Specialist
## 41 52 USD USA Digital Media Coordinator
## 42 53 USD USA Chief Fiscal Officer
## 43 54 USD USA Art Director
## 44 55 USD USA Personal Executive Assistant
## 45 55 USD USA Account Manager
## 46 55 USD Switzerland VM BACKOFFICE COORDINATOR
## 47 55 USD USA Talent Acquisition Partner
## 48 55 USD USA Customer service manager
## 49 55 USD USA Web Customer Service Rep.
## 50 55 GBP UK accreditation manager
## 51 56 CAD Canada Analyst
## 52 56 USD USA homie
## 53 57 USD USA Program director
## 54 57 USD USA Content Specialist
## 55 58 USD USA Quality Assurance Lead
## 56 60 USD USA Store Manager
## 57 60 EUR Germany Academic Librarian
## 58 61 USD USA Managing Editor
## 59 62 EUR spain project manager
## 60 65 USD USA Teacher
## 61 65 USD USA supply Planner
## 62 66 EUR Germany Software Developer
## 63 69 GBP UK Head of Business Support
## 64 70 USD USA Regional Manager
## 65 70 USD USA Infection Prevention Associate
## 66 70 USD America Freelance Art Director
## 67 71 GBP UK Security Analyst
## 68 72 USD USA Technical Writer
## 69 72 EUR Netherlands Frontend Engineer
## 70 73 CAD Canada Director of Prevention and Early Interve
## 71 74 CAD Canada Associate
## 72 76 EUR Germany Senior Developer / Consultant
## 73 76 USD USA vv
## 74 78 USD USA MSW
## 75 79 EUR Germany Data Manager
## 76 80 USD USA Facilitator
## 77 80 USD USA Production editor
## 78 80 USD USA Job title
## 79 90 GBP UK Senior Softwaare Engineer (Web)
## 80 90 USD USA RN
## 81 95 CAD Canada Change Management Consultant
## 82 97 EUR Germany Strategic Buyer
## 83 100 USD USA Regional
## 84 105 USD USA Senior Marketing Manager
## 85 105 USD USA Registered Nurse
## 86 108 USD USA Engineering Supervisor
## 87 110 USD USA Health Administrator
## 88 121 EUR Germany CTO
## 89 121 EUR Germany Senior Support Engineer
## 90 130 USD USA Chief Data Scientist
## 91 130 USD USA Coach
## 92 145 USD USA Software Engineer
## 93 150 USD USA Senior Data Scientist
## 94 155 USD USA Manager
## 95 160 USD Canada General manager
## 96 180 USD USA Director of Operations
## 97 185 USD USA Senior Director, ESG and Investor Relati
## 98 240 USD USA Chief Data Scientist
## 99 253 JPY Japan Travel designer
## 100 350 USD USA Counsel
## 101 456 USD Nigeria Ibterb
## 102 500 USD Ghana Data scientist
## 103 880 USD India Online Tutor
## 104 1000 USD USA Teaching practicum supervisor
There are 104 salary values of $1000 or less, with a number of them being values of $0.
Reasonable to assume that these people were using shorthand and reporting salary in terms of $1000s which is sometimes the norm: i.e $40 really means $40k or $40,000. Country and Job title values for these respondents are not contradictory to this theory.
Transform all of the values below $240 (to be somewhat conservative), by multiplying them by a factor of 1000.
salary_data2 <- salary_data2 %>%
mutate(salary_USD = ifelse(salary_USD < 240, salary_USD * 1000, salary_USD))Now observe the smallest values again, this time those salary values below $20,000.
salary_data2 %>%
select(salary_USD, currency, country, job_title) %>%
mutate(job_title = substr(job_title, 1, 30)) %>%
filter(salary_USD <= 20000) %>%
arrange(salary_USD)## salary_USD currency country job_title
## 1 0 USD USA "mum" ;)
## 2 0 USD USA Executive Director
## 3 0 USD USA Attorney
## 4 0 USD USA Student teacher
## 5 0 USD USA Product Marketer
## 6 0 USD USA Househusband
## 7 0 USD USA Founder
## 8 0 USD USA Unemployed
## 9 0 USD USA Government Relations Director
## 10 0 USD USA Realtor
## 11 0 USD na na
## 12 0 USD USA College Senior
## 13 0 USD USA Student
## 14 0 USD USA Homemaker
## 15 0 USD USA data science student
## 16 0 USD USA N/A
## 17 240 USD USA Chief Data Scientist
## 18 253 JPY Japan Travel designer
## 19 350 USD USA Counsel
## 20 456 USD Nigeria Ibterb
## 21 500 USD Ghana Data scientist
## 22 880 USD India Online Tutor
## 23 1000 USD USA Software Development Lead
## 24 1000 USD USA Teaching practicum supervisor
## 25 1442 USD USA Associate Team Lead - Revenue
## 26 1727 AUD/NZD USA Private Tutor
## 27 2072 ZAR South Africa Manager
## 28 2072 ZAR South Africa Manager
## 29 2126 EUR Netherlands Marketing
## 30 2350 USD USA Graduate Research Assistant -
## 31 2400 USD Pakistan Quality Assurance Auditor
## 32 2897 EUR Belgium Customer Service/Tech Support
## 33 3000 ZAR spain chief executive official
## 34 3055 GBP UK Library customer advisor
## 35 3200 USD USA Registration and enrollment of
## 36 3348 SEK USA Paraprofessional
## 37 3592 EUR USA Soldier
## 38 4000 USD USA freelance captioner
## 39 4000 USD USA Property Manager
## 40 4000 USD ss sdsd
## 41 4100 USD USA patient services associate
## 42 4800 USD USA Sales Development Representati
## 43 5100 USD USA Medical Biller
## 44 5100 USD USA Assistant Director of Marketin
## 45 5265 USD Malaysia QC
## 46 5700 USD USA Accounts Payable Specialist
## 47 6000 USD USA server
## 48 6111 GBP UK Clinical Evaluation Manager
## 49 6209 USD USA Senior Research Analyst
## 50 6400 USD USA Chief of Staff
## 51 6555 EUR Germany Narrative Designer
## 52 6600 USD India Analyst
## 53 7250 USD Pakistan iOS Developer
## 54 7390 USD Nigeria Senior Analyst, Management Con
## 55 7847 EUR Bosnia and Herzegovina Marketing Automation Analyst
## 56 8000 USD USA Development Officer
## 57 9000 USD USA Online Editor
## 58 9160 GBP UK freelance illustrator
## 59 9600 USD Cuba Frontend Engineer
## 60 9658 EUR Spain Translator
## 61 10000 USD USA Sr Consultant
## 62 10000 USD USA Student
## 63 10382 EUR Romania Technology Associate
## 64 10500 USD USA Systems Engineer
## 65 10616 JPY Japan Compliance / Privacy Manager
## 66 10700 USD USA Software Engineer Technical Su
## 67 11000 USD USA software engineer
## 68 11000 USD Pakistan Technical Writer
## 69 11100 USD USA Resident's Assistant
## 70 11500 USD USA Assistant Professor
## 71 11800 USD USA Producer
## 72 12000 USD USA Bookkkeeper
## 73 12000 USD USA International Board Certified
## 74 12000 USD USA Proposal manager
## 75 12000 USD USA Small Business Administrator
## 76 12000 USD USA Engineer
## 77 12000 USD India Analyst
## 78 12000 USD USA Assistant Editor
## 79 12000 USD USA Music Librarian
## 80 12000 USD UK Senior illustrator
## 81 12000 USD USA Paraprofessional
## 82 12000 USD USA Associate Attorney
## 83 12000 USD USA TA
## 84 12093 CAD Canada Administrative assistant (part
## 85 12559 GBP UK Chartered accountant
## 86 12600 USD USA Substitute teacher
## 87 13000 USD USA Restaurant manager
## 88 13000 USD USA Manager of Customer Support
## 89 13000 USD USA Host/Server Assistant
## 90 13000 USD USA Server
## 91 13000 USD USA Associate
## 92 13000 USD USA Retired
## 93 13000 USD USA Transactional (Contracts) Atto
## 94 13045 EUR Greece Software Engineer
## 95 13300 USD USA Executive Director
## 96 13400 USD USA Graduate Assistant
## 97 13500 USD USA Growth Marketing Manager
## 98 13500 USD USA Freelance Consutant
## 99 13520 USD USA specialist
## 100 13560 USD Kuwait Instructional coach
## 101 13600 USD USA Server
## 102 13763 EUR Germany Apprentice Mechanic for Constr
## 103 13888 GBP UK Early Years Practitioner
## 104 14000 USD USA Special education teaching ass
## 105 14000 USD USA Author
## 106 14000 USD USA waitress
## 107 14000 USD USA Talent Development Manager
## 108 14453 ZAR South Africa Sysadmin
## 109 14500 USD USA Senior Tech
## 110 14503 CAD Canada Freelance artist/illustrator
## 111 14512 CAD Canada Research Fellow
## 112 14559 EUR Latvia hobby teacher
## 113 14572 GBP UK Administrator
## 114 14850 USD USA Adjunct Instructor
## 115 14920 ZAR South Africa Editor
## 116 15000 USD USA writer
## 117 15000 USD USA Graduate Student
## 118 15000 USD USA Assistant
## 119 15000 USD USA Adjunct professor and PhD stud
## 120 15000 USD USA Graduate Teaching Assistant
## 121 15000 USD USA Adjunct Instructor
## 122 15000 USD USA Graduate Student
## 123 15000 USD USA Assistant Director
## 124 15000 USD USA Graduate Assistant for Athleti
## 125 15000 USD dbfemf mn jj
## 126 15000 USD Colombia Profesional I
## 127 15080 USD USA Cashier
## 128 15080 USD USA Line Cook
## 129 15100 USD USA Operations Manager
## 130 15276 GBP UK Cleaner
## 131 15400 USD USA Americorps Member
## 132 15461 EUR Portugal L2 Cyber Security Analyst
## 133 15600 USD USA Studio/Touring Musician
## 134 15758 EUR Germany Dancer
## 135 16000 USD USA Student Worker
## 136 16000 USD USA Director
## 137 16000 USD USA Education and Outreach Coordin
## 138 16000 USD USA Writer
## 139 16000 USD USA Administrative Assistant
## 140 16000 USD USA CFO
## 141 16115 CAD Canada Photographer
## 142 16200 USD USA Senior Software Engineer
## 143 16200 USD Chile Mathematics Teacher
## 144 16500 USD USA Vice President, Analytics
## 145 16578 ZAR South Africa System administration and proj
## 146 16640 USD USA Recruiting Assistant
## 147 16800 USD USA Literacy Tutor
## 148 16902 EUR Greece Administrative coordinator
## 149 16970 EUR Spain Researcher
## 150 17000 USD USA Research assistant
## 151 17000 USD USA Undergraduate Research Assista
## 152 17000 USD USA Secretary
## 153 17200 USD USA Social Studies and Technology
## 154 17300 USD USA Project Coordinator/Communicat
## 155 17360 GBP UK Director
## 156 17384 EUR Spain Frontend programmer
## 157 17472 USD USA Assistant Manager
## 158 17500 USD USA Graduate Research Assistant
## 159 17505 EUR Spain Administrative assistant
## 160 17680 USD USA Special Events Assistant
## 161 17747 EUR Lithuania Assistant
## 162 17843 EUR France High Ropes Instructor
## 163 18000 USD USA Sales Associate
## 164 18000 USD USA Circulation Supervisor
## 165 18000 USD USA Graduate Student
## 166 18000 USD USA cartoonist
## 167 18000 USD USA Software Architect
## 168 18000 USD USA PhD student/teaching assistant
## 169 18000 USD USA Office Associate
## 170 18000 USD USA Graduate Research Assitant
## 171 18000 USD USA Senior Director, Sales
## 172 18000 USD USA Substitute Teacher
## 173 18000 USD USA Bus Driver
## 174 18021 AUD/NZD USA Library Assistant
## 175 18034 EUR France Field technician in archaeolog
## 176 18054 GBP UK Pupil Support Assistant
## 177 18109 EUR UK Marketing assistant
## 178 18500 USD USA Graduate Student
## 179 18500 USD USA Consumer directed care attenda
## 180 18532 CAD Canada Optical assistant
## 181 18543 CAD Canada Union clerk
## 182 18566 EUR Italy PhD student
## 183 18720 USD USA Cashier/back-end food prep
## 184 18720 USD USA Cashier/Clerk
## 185 18720 USD USA Page/Shelver
## 186 18720 USD USA Stocker
## 187 18720 USD USA Student
## 188 19000 USD USA Teacher Assistant
## 189 19000 USD USA Arts Education Director
## 190 19000 USD USA Special Education Paraprofessi
## 191 19065 ZAR South Africa Data Analyst
## 192 19108 CAD Canada Freelance Writer/Editor
## 193 19200 USD USA Admin Assistant
## 194 19276 EUR France Java Software Developer
## 195 19395 EUR Italy Post Doc
## 196 19500 USD USA Museum Director
## 197 19502 EUR Spain marketing designer
## 198 19583 USD USA Executive Assistant
## 199 19760 USD USA Order Processor
## 200 19885 USD USA Sales Associate
## 201 19978 USD USA Reference Librarian
## 202 19998 GBP UK Pharmacy Technician
## 203 20000 USD USA Graduate Student
## 204 20000 USD USA Part-time Faculty
## 205 20000 USD USA Cataloguer
## 206 20000 USD USA Paraprofessional
## 207 20000 USD USA Dishwasher
## 208 20000 USD USA Cashier
## 209 20000 USD USA Social media producer
## 210 20000 USD USA Tutor
## 211 20000 USD USA Unit secretary
## 212 20000 USD USA Adjunct
## 213 20000 USD Zimbabwe manager
We still notice many unexpectedly small values (as well as those $0 values). Some may be due to the weaker economies of those countries, however some appear to be human error.
Based on the minimum hourly wage of multiple countries, and some simple calculations, I decide to remove observations from a series of developed countries, that fall below a conservative salary of US$23,000.
We subsequently also remove a small set of salaries that fall below $11,000, that were originally in either JPY, SEK or ZAR, as these values do not seem realistic either.
salary_data2 <- salary_data2 %>%
filter(salary_USD > 23000 | (salary_USD < 23000 & !(currency %in% c("USD", "GBP", "EUR", "AUD/NZD", "CAD"))))
salary_data2 <- salary_data2 %>%
filter(salary_USD > 11000)Rechecking salary_USD distribution after initial
outlier treatment
Re-check the distribution by means of boxplot and histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14453 55000 75000 89394 108000 5000044
boxplot(salary_data2$salary_USD, main = "Distribution of salary_USD after intitial outlier removal")hist(salary_data2$salary_USD, xlab = "salary_USD", main = "Distribution of salary_USD after intitial outlier removal", breaks = 500)Extreme positive skew still present.
Observe histogram (zoomed into the lower section of y axis) to observe potential outliers still present.
breaks <- seq(min(salary_data2$salary_USD, na.rm = TRUE), max(salary_data2$salary_USD, na.rm = TRUE), length.out = 100)
ggplot(salary_data2, aes(x = salary_USD)) +
geom_histogram(breaks = breaks, fill = "#007bc2", col = "white") +
coord_cartesian(ylim = c(0, 15)) +
labs(title = "Distribution of salary_USD (zoomed into the lower section of y axis)")Histograms of data split into two parts to aid visualization of the low frequency positive outliers.
#set new parameter to display graphs side by side
par(mfrow = c(1, 2))
#histogram of trimmed salary data (up to $300,000)
data = salary_data2$salary_USD
breaks = seq(min(data), 300000, length.out = 80)
subset = (data[data <= max(breaks)])
hist(subset, breaks = breaks, col = "#007bc2", xlab = "Salary (USD)", main = "Salary Distribution (from $23,000 to $300,000)", cex.main = 0.6)
#histogram of trimmed salary data (from 303000 to max)
breaks = seq(300000, max(data), length.out = 80)
subset = (data[data >= min(breaks)])
hist(subset, breaks = breaks, col = "#007bc2", xlab = "Salary (USD)", main = "Salary Distribution (from $300,000 to $5,000,044)", cex.main = 0.6)Index plot of data ordered by salary_USD.
#first sort data by increasing salary_USD
salary_data2 %>%
arrange(salary_USD) %>%
ggplot(aes(x = row_number(salary_USD), y = salary_USD)) +
geom_point() +
labs(x = "Ordered Observations", y = "Salary (USD)")There are five positive outliers, starting at approximately $1,600,000.
salary_USD is severely positively skewed, with what
looks to be an exponentially increasing trend.
We can use a couple of different methods to gain more understanding of the extremity of the outliers.
Z-Score Method for Outlier Detection
Z-score method of identifying outliers: values further than 3 standard deviations from the mean.
#Convert values to z-scores and identify those with absolute value > 3
z_scores <- scale(salary_data2$salary_USD)
outliers_z_logical <- abs(z_scores) > 3
#Create a data-frame with the z-scores and associated USD salary values that are considered outliers
z_scores_outliers <- z_scores[outliers_z_logical]
salary_USD_outliers <- salary_data2$salary_USD[which(outliers_z_logical)]
outliers_z_df <- data.frame(salary_USD_outliers, z_scores_outliers)
#view dataframe (too long to print)
#outliers_z_df %>%
# arrange(desc(salary_USD_outliers))
#identify the number of outliers
paste("number of outliers: ", length(z_scores_outliers))## [1] "number of outliers: 176"
## [1] "Lowest value: 302543"
According to this method, there are 176 outliers, all positive, some of which are massively beyond 3 standard deviations from the mean. Values greater than or equal to $302,543 are considered outliers.
IQR Method for Outlier Detection
Another approach, IQR method of identifying outliers: Points further than 1.5 * IQR from the median are considered outliers.
#obtain quantiles and IQR
Q1 <- quantile(salary_data2$salary_USD, 0.25)
Q3 <- quantile(salary_data2$salary_USD, 0.75)
IQR_value <- Q3 - Q1
#identify outliers using those values
outliers_iqr_logical <- (salary_data2$salary_USD < (Q1 - 1.5 * IQR_value)) | (salary_data2$salary_USD > (Q3 + 1.5 * IQR_value))
iqr_outliers <- salary_data2$salary_USD[which(outliers_iqr_logical)]
#View outliers, in ascending order (too large to print)
#sort(iqr_outliers, decreasing = FALSE)
#identify the number of outliers
paste("number of outliers: ", length(iqr_outliers))## [1] "number of outliers: 1123"
## [1] "Lowest value: 187798"
There are 1123 resulting outliers — all positive - using this method.
Values greater than or equal to $187,798 are considered outliers, which is a more aggressive approach than the Z-score method used previously.
Comparing Sample Data with Population Data
There are a lot of observations with extremely high salary values. The potential for bias is high, as this is an anonymous self-selected survey over the internet. There are multiple salary values over $1,000,000 and many more residing close to this value.
After doing some online research, we can compare the proportion of high earners in this survey with the proportion of high earners in the general population, to get a general gauge of how likely we are to observe data this extreme, and then make a judgement about how realistic these values may be.
In the following table, the ‘population’ column represents an estimation, based on online research, for the proportion of the population with a salary level equal or greater to the assigned value, and the ‘sample’ value represents the proportion within this survey.
#create a function that takes threshold values and data frame, and returns proportion of data at least that large
proportion_func <- function(threshold, data){
data %>%
filter(salary_USD >= threshold) %>%
summarize(proportion = signif(n() / nrow(data), 2))
}
#establish thresholds and calculate proportions
thresholds <- c(300000, 500000, 700000, 1000000, 1250000, 1500000, 2000000, 2500000, 3000000, 4000000)
proportions = sapply(thresholds, proportion_func, data = salary_data2)
names(proportions) = thresholdslibrary(knitr)
# Assign table columns (vectors) then collate together
thresh_values <- c("$300k", "$500k", "$700k", "$1m", "$1.25m", "$1.5m", "$2m", "$2.5m", "$3m", "$4m")
pop_values <- c(0.01, 0.002, 0.001, 0.0009, 0.0008, 0.0007, 0.0005, 0.0002, 0.0001, 0.0001)
samp_values <- as.vector(unlist(proportions))
df <- data.frame(salary = thresh_values, population = pop_values, sample = samp_values)
# Set options to display numbers in absolute terms
options(scipen = 999)
# Print table
kable(df, caption = "Estimated Proprtions of Population and Survey Sample over Specific Salary Thresholds")| salary | population | sample |
|---|---|---|
| $300k | 0.0100 | 0.007400 |
| $500k | 0.0020 | 0.002300 |
| $700k | 0.0010 | 0.001300 |
| $1m | 0.0009 | 0.000580 |
| $1.25m | 0.0008 | 0.000400 |
| $1.5m | 0.0007 | 0.000180 |
| $2m | 0.0005 | 0.000110 |
| $2.5m | 0.0002 | 0.000073 |
| $3m | 0.0001 | 0.000073 |
| $4m | 0.0001 | 0.000036 |
The table suggests that the survey data seems to be within reasonable expectation.
Transformation and Further Outlier Treatment
Logarithm transformation
Income based data often features a distinct positive skew.
Due to extreme positive skew in our salary_USD variable,
we can observe the effect of a log transformation.
#log transformation of salary
salary_USD_log = log(salary_data2$salary_USD)
data = salary_USD_log
#histogram
par(mfrow = c(1,2))
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "Salary (USD)", main = "Log Transformed Salary")
#boxplot
boxplot(data, col = "skyblue", xlab = "Salary (USD)", main = "Log Transformed Salary")A more symmetric distribution is obtained (although outliers are still present).
Combination of Logarithm Transformation and Various degrees of Outlier Trimming
Comparison of side-by-side boxplots and histograms allows us to assess the affect of varying degrees of trimming outliers.
First with the un-transformed salary_USD variable.
#Esatablish new data-frames with differing degrees of outlier trimming
salary_data_tr <- salary_data2 %>%
filter(salary_USD < 3000000, salary > 30000)
salary_data_tr2 <- salary_data2 %>%
filter(salary_USD < 1500000, salary > 30000)
salary_data_tr3 <- salary_data2 %>%
filter(salary_USD < 1000000, salary > 30000)
salary_data_tr4 <- salary_data2 %>%
filter(salary_USD < 300000, salary > 30000)
salary_data_tr5 <- salary_data2 %>%
filter(salary_USD < 190000, salary > 30000)
#Comparative salary boxplots (original variable, with progressively more outliers trimmed)
par(mfrow = c(2,3))
boxplot(salary_data2$salary_USD, col = "lightblue", main = "No additional trimming")
boxplot(salary_data_tr$salary_USD, col = "lightblue", main = "salary < $3,000,000")
boxplot(salary_data_tr2$salary_USD, col = "lightblue", main = "salary < $1,500,000")
boxplot(salary_data_tr3$salary_USD, col = "lightblue", main = "salary < $1,000,000")
boxplot(salary_data_tr4$salary_USD, col = "lightblue", main = "salary < $300,000")
boxplot(salary_data_tr5$salary_USD, col = "lightblue", main = "salary < $190,000")par(mfrow = c(2,3))
#Comparative salary histograms (original variable, with progressively more outliers trimmed)
data = salary_data2$salary_USD
breaks = seq(min(data), max(data), length.out = 500)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "No additional trimming")
data = salary_data_tr$salary_USD
breaks = seq(min(data), max(data), length.out = 400)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $3,000,000")
data = salary_data_tr2$salary_USD
breaks = seq(min(data), max(data), length.out = 300)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $1,500,000")
data = salary_data_tr3$salary_USD
breaks = seq(min(data), max(data), length.out = 100)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $1,000,000")
data = salary_data_tr4$salary_USD
breaks = seq(min(data), max(data), length.out = 60)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $300,000")
data = salary_data_tr5$salary_USD
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $190,000")Then we compare with the log-transformed distribution, with varying degrees of outliers trimmed.
#log transform each trimmed salary vector
salary_USD_tr_log = log(salary_data_tr$salary_USD)
salary_USD_tr2_log = log(salary_data_tr2$salary_USD)
salary_USD_tr3_log = log(salary_data_tr3$salary_USD)
salary_USD_tr4_log = log(salary_data_tr4$salary_USD)
salary_USD_tr5_log = log(salary_data_tr5$salary_USD)
#plot boxplots
par(mfrow = c(2,3))
boxplot(salary_USD_log, col = "lightblue", main = "log(salary)")
boxplot(salary_USD_tr_log, col = "lightblue", main = "log(salary < $3,000,000)")
boxplot(salary_USD_tr2_log, col = "lightblue", main = "log(salary < $1,500,000)")
boxplot(salary_USD_tr3_log, col = "lightblue", main = "log(salary < $1,000,000)")
boxplot(salary_USD_tr4_log, col = "lightblue", main = "log(salary < $300,000)")
boxplot(salary_USD_tr5_log, col = "lightblue", main = "log(salary < $190,000)")par(mfrow = c(2,3))
data = salary_USD_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary)")
data = salary_USD_tr_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $3,000,000)")
data = salary_USD_tr2_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $1,500,000)")
data = salary_USD_tr3_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $1,000,000)")
data = salary_USD_tr4_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $300,000)")
data = salary_USD_tr5_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $190,000)")Using the log transformed distribution with observations over $300,000 trimmed off, we can again determine potential outliers using the previous Z-score and IQR methods:
#Z-scores method
#Convert values to z-scores and identify those with absolute value > 3
z_scores2 <- scale(salary_USD_tr4_log)
outliers_z_logical2 <- abs(z_scores2) > 3
#Create a data-frame with the z-scores and associated USD salary values that are considered outliers
z_scores_outliers2 <- z_scores2[outliers_z_logical2]
salary_USD_log_outliers2 <- salary_USD_tr4_log[which(outliers_z_logical2)]
salary_USD_outliers2 <- salary_data_tr4$salary_USD[which(outliers_z_logical2)]
outliers_z_df2 <- data.frame(salary_USD_outliers2, salary_USD_log_outliers2, z_scores_outliers2)
#view dataframe
outliers_z_df2 %>%
arrange(desc(salary_USD_outliers2))## salary_USD_outliers2 salary_USD_log_outliers2 z_scores_outliers2
## 1 19065 9.855609 -3.123233
## 2 16578 9.715832 -3.430592
## 3 14920 9.610458 -3.662300
## 4 14453 9.578657 -3.732227
#obtain quantiles and IQR
Q1 <- quantile(salary_USD_tr4_log, 0.25)
Q3 <- quantile(salary_USD_tr4_log, 0.75)
IQR_value <- Q3 - Q1
#identify outliers using those values
outliers_iqr_logical <- (salary_USD_tr4_log < (Q1 - 1.5 * IQR_value)) | (salary_USD_tr4_log > (Q3 + 1.5 * IQR_value))
#Create a data-frame with the outliers for both log transformed and non-transformed values
iqr_outliers_log_2 <- salary_USD_tr4_log[which(outliers_iqr_logical)]
iqr_outliers_2 <- salary_data_tr4$salary_USD[which(outliers_iqr_logical)]
iqr_outliers_df2 <- data.frame(iqr_outliers_2, iqr_outliers_log_2)
#view dataframe
iqr_outliers_df2 %>%
arrange(desc(iqr_outliers_2))## iqr_outliers_2 iqr_outliers_log_2
## 1 295000 12.594731
## 2 295000 12.594731
## 3 295000 12.594731
## 4 295000 12.594731
## 5 292000 12.584509
## 6 290000 12.577636
## 7 19065 9.855609
## 8 16578 9.715832
## 9 14920 9.610458
## 10 14453 9.578657
By observing the boxplots and histograms for the regular
salary_USD variable, it is clear that there is a very high
degree of positive skew, with a high frequency of positive outliers.
Each degree of trimming outliers tends to progressively increase
normality.
The distribution with values exceeding $300,000 trimmed off
(salary_data_tr4) appears to display an improvement to
normality and reduced outlier frequency while offering a bit more
constraint versus the more aggressive approach of trimming values
exceeding $190,000. The Z-score method of outlier detection supports
this approach.
This distribution results in the following summary statistics, and density plot:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14453 56000 76000 87763 108000 295000
| count | average | std_err | median | std_dev |
|---|---|---|---|---|
| 26342 | 87763 | 266 | 76000 | 43150 |
The distribution appears relatively normal with significant positive skew. There is a very large range with the smallest value of $14,453, equating to just under 5% of the largest value, $295,000. In comparison to the median salary value, $76,000, the mean ($87,763) is being pulled in a positive direction due to the positive skew present. The IQR also reflects this.
Income-based data often exhibits a positive skew for several reasons:
Income Inequality: Many societies have significant income inequality, with a relatively small number of individuals or households earning very high incomes, while the majority earn lower incomes. This inequality causes the income distribution to be positively skewed, with a long right tail.
Capped Lower Bound: In many cases, there is a minimum income threshold below which individuals cannot earn less. This creates a lower bound or floor effect, resulting in a concentration of observations at the lower end of the income scale, leading to positive skewness.
Diminishing Returns: As income increases, the marginal utility of additional income tends to decrease. This means that while some individuals may earn very high incomes, the rate at which their income grows may slow down, contributing to a positively skewed distribution.
Outliers: Incomes at the very high end of the scale, such as those of top executives, celebrities, or entrepreneurs, can be extreme outliers and contribute to the positive skew of the distribution.
Therefore, a log-transformation of salary_USD seems like
a very suitable measure to reduce skew and aid the assumptions required
for upcoming inferential testing.
We see that the log-transformation does in-fact seem to significantly improve normality.
Again, the distribution with no values exceeding $300,000,
salary_USD_tr4_log, seems to display a healthy balance of
trimmed outliers:.
data <- data.frame(salary = salary_USD_tr4_log)
ggplot(data, aes(x = salary)) +
geom_density(alpha = 0.4, linewidth = 0.1, fill = "blue", color = "blue") +
labs(title = bquote(bold("log(Salary < $300,000)")), x = "log(Salary (USD))", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16))The plots indicate that we only have a few potential outliers remaining in this distribution.
Using the Z-scores method, there are only four observations that meet the criteria of being outliers (this is significantly less than 134 outliers - the number of outliers remaining with the previous iteration of outlier trimming (< $1,000,000).
Using the IQR method, the same four observations meet the criteria of being negative outliers, and we also have six positive outliers (again, this is significantly less than 164 outliers - the number of outliers remaining with the previous iteration of outlier trimming (< $1,000,000).
Therefore, we may consider this distribution
(salary_USD_tr4_log) when proceeding with some of our
analysis methods that require normality and homoscedasticity.
Bar graphs displaying distributions of factor variables.
# Calculate counts for each level then create a bar graph of the counts
education_counts <- table(salary_data_tr4$education)
ggplot(data = NULL, aes(x = factor(names(education_counts),
levels = levels(salary_data_tr4$education)), y = education_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = education_counts),
vjust = ifelse(education_counts > 0.90 * max(education_counts), 1.5, -0.5), size = 3) +
theme_classic() +
labs(x = "Education", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Education"))) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7), axis.title.x = element_text(margin = margin(t = 10), hjust = 0.5)) +
annotate("rect", xmin = 5.9, xmax = 7.1, ymin = 11700, ymax = 12300,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 6.5, y = 12000, label = "Box value = count", size = 3) work_exp_overall_counts <- table(salary_data_tr4$work_exp_overall)
ggplot(data = NULL, aes(x = factor(names(work_exp_overall_counts),
levels = levels(salary_data_tr4$work_exp_overall)), y =work_exp_overall_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = work_exp_overall_counts),
vjust = ifelse(work_exp_overall_counts > 0.90 * max(work_exp_overall_counts), 1.5, -0.5), size = 3) +
theme_classic() +
labs(x = "Work Experience (Overall)", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Overall Work Experience"))) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 6.7, xmax = 8.3, ymin = 9700, ymax = 10300,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 7.5, y = 10000, label = "Box value = count", size = 3) work_exp_field_counts <- table(salary_data_tr4$work_exp_field)
ggplot(data = NULL, aes(x = factor(names(work_exp_field_counts),
levels = levels(salary_data_tr4$work_exp_field)), y = work_exp_field_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = work_exp_field_counts),
vjust = ifelse(work_exp_field_counts > 0.90 * max(work_exp_field_counts), 1.5, -0.5), size = 3) +
theme_classic() +
labs(x = "Work Experience (in field)", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Work Experience in Field"))) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 6.8, xmax = 8.2, ymin = 5800, ymax = 6200,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 7.5, y = 6000, label = "Box value = count", size = 3) age_group_counts <- table(salary_data_tr4$age_group)
ggplot(data = NULL, aes(x = factor(names(age_group_counts),
levels = levels(salary_data_tr4$age_group)), y = age_group_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = age_group_counts),
vjust = ifelse(age_group_counts > 0.90 * max(age_group_counts), 1.5, -0.5), size = 3) +
theme_classic() +
labs(x = "Age", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Age Group"))) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 5.9, xmax = 7.1, ymin = 11700, ymax = 12300,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 6.5, y = 12000, label = "Box value = count", size = 3) gender_counts <- table(salary_data_tr4$gender)
ggplot(data = NULL, aes(x = factor(names(gender_counts), levels = c("Man", "Woman", "Non-binary", "Other or prefer not to answer")), y = gender_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = gender_counts), vjust = ifelse(gender_counts > 0.40 * max(gender_counts), 1.5, -0.5), size = 3) + # Label the bars with counts
theme_classic() +
labs(x = "Gender", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Gender"))) +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 3.6, xmax = 4.4, ymin = 18500, ymax = 19500,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 4, y = 19000, label = "Box value = count", size = 3) currency_counts <- table(salary_data_tr4$currency)
ggplot(data = NULL, aes(x = names(currency_counts), y = currency_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = currency_counts), vjust = ifelse(currency_counts > 0.40 * max(currency_counts), 1.5, -0.5), size = 3) + # Label the bars with counts
theme_classic() +
labs(x = "Currency", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Currency"))) +
theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 1, xmax = 3, ymin = 20500, ymax = 21500,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 2, y = 21000, label = "Box value = count", size = 3) region_counts <- table(salary_data_tr4$region)
threshold <- 0.20 * max(region_counts)
ggplot(data = NULL, aes(x = names(region_counts), y = region_counts)) +
geom_col(fill = "#007bc2") +
geom_label(aes(label = region_counts,
y = ifelse(region_counts > threshold, region_counts - threshold * 0.8, region_counts)),
vjust = -0.5, size = 2.5, color = "black") +
theme_classic() +
labs(x = "Region", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Region"))) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(margin = margin(t = 10))) +
annotate("rect", xmin = 10, xmax = 12, ymin = 21000, ymax = 23000,
fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
annotate("text", x = 11, y = 22000, label = "Box value = count", size = 3) # Display plots side by side
#library(gridExtra)
#grid.arrange(plot_gender, plot_currency, ncol = 2, widths = c(0.8, 0.8))Contingency tables for factor variables with many levels:
industry_counts <- salary_data_tr4 %>%
group_by(industry) %>%
count() %>%
filter(n > 500) %>%
arrange(desc(n))
kable(industry_counts, caption = "Distribution of factor: industry (n > 500 only)")| industry | n |
|---|---|
| Computing or Tech | 4477 |
| Education (Higher Education) | 2313 |
| Nonprofits | 2308 |
| Government and Public Administration | 1808 |
| Health care | 1785 |
| Engineering or Manufacturing | 1709 |
| Accounting, Banking & Finance | 1703 |
| Marketing, Advertising & PR | 1064 |
| Law | 1025 |
| Other | 861 |
| Business or Consulting | 812 |
| Education (Primary/Secondary) | 754 |
| Media & Digital | 734 |
| Insurance | 505 |
city_counts <- salary_data_tr4 %>%
group_by(city) %>%
count() %>%
filter(n > 200) %>%
arrange(desc(n))
kable(city_counts, caption = "Distribution of factor: city (n > 200 only)")| city | n |
|---|---|
| Boston | 759 |
| Chicago | 727 |
| New York | 684 |
| Seattle | 666 |
| London | 500 |
| New York City | 492 |
| San Francisco | 478 |
| Los Angeles | 447 |
| Portland | 414 |
| Toronto | 407 |
| Austin | 336 |
| Washington | 330 |
| Denver | 311 |
| Atlanta | 307 |
| Minneapolis | 306 |
| Philadelphia | 261 |
| Houston | 254 |
| Washington, DC | 246 |
| Dallas | 221 |
| NYC | 212 |
| Vancouver | 203 |
us_state_counts <- salary_data_tr4 %>%
group_by(us_state) %>%
count() %>%
filter(n > 200) %>%
arrange(desc(n))
kable(us_state_counts, caption = "Distribution of factor: us_state (n > 200 only)")| us_state | n |
|---|---|
| 4206 | |
| California | 2522 |
| New York | 2104 |
| Massachusetts | 1478 |
| Texas | 1216 |
| Illinois | 1169 |
| Washington | 1143 |
| District of Columbia | 954 |
| Pennsylvania | 902 |
| Virginia | 752 |
| Minnesota | 709 |
| Ohio | 628 |
| Colorado | 615 |
| Oregon | 607 |
| North Carolina | 577 |
| Maryland | 547 |
| Michigan | 521 |
| Georgia | 510 |
| Florida | 494 |
| Wisconsin | 455 |
| New Jersey | 385 |
| Missouri | 324 |
| Indiana | 308 |
| Arizona | 294 |
| Tennessee | 273 |
| Connecticut | 223 |
race_main_counts <- salary_data_tr4 %>%
group_by(race_main) %>%
count() %>%
arrange(desc(n))
kable(race_main_counts, caption = "Distribution of factor: race_main")| race_main | n |
|---|---|
| White | 21877 |
| Asian or Asian American | 1307 |
| Another option not listed here or prefer not to answer | 711 |
| Black or African American | 630 |
| Hispanic, Latino, or Spanish origin | 553 |
| Hispanic, Latino, or Spanish origin, White | 362 |
| Asian or Asian American, White | 331 |
| Other | 158 |
| Black or African American, White | 113 |
| Middle Eastern or Northern African, White | 73 |
| Native American or Alaska Native, White | 65 |
| Middle Eastern or Northern African | 64 |
| White, Another option not listed here or prefer not to answer | 59 |
| Native American or Alaska Native | 39 |
Education, work experience and age appear to be somewhat normally distributed.
The majority of respondents have a college degree as their highest form of education. A Master’s degree is also fairly popular, while the remainder of options are all much less frequent.
For overall work experience, note that the band widths for each category are not equal, affecting our ability to make fair comparisons. Work experience of 11 - 20 years is the most frequent option, with 8 - 10 years experience coming in second, however, it has a much smaller bandwidth. Frequency progressively decreases towards the tails. As expected, very few respondents have worked for 41 or more years.
Work experience in one’s particular field is more evenly distributed over the categories ranging from 2 years to 20 years. Again, frequency decreases towards the tails.
When observing the age group graph, we again notice some inconsistencies between the bandwidths of categories. The 25 - 34 category is the most frequent, with 35 - 44 a bit less so and then a fairly large decrease to the next greatest, 55 - 64.
Interestingly, the gender distribution is significantly weighted towards Women, with approximately four times more respondents than men.
USD is the most common currency by a significant margin. Canadian dollar, Great Britain pound, Euro and Australia/NZ dollar are the next most common, with the remainder of currencies being very infrequent.
Region is almost completely dominated by Northern America, with the next most common being Northern Europe.
In industry, computing or technology is considerably more frequent than the rest. White is by far the most frequent race response.
We are interested in exploring whether there is any correlation between a person’s global region and annual salaries. Initially we can perform an exploratory analysis of the data to look for patterns and trends. Following this, we will perform some inferential tests to gain more statistical insight.
Salary by region statistics (ordered by median salary)
#Alternative method:
#library(psych)
#region_summaries <- describeBy(salary_USD ~ region, data = salary_data_tr4, mat = TRUE, digits = 2)
#create a summary statistics table for region factor
region_summaries <- salary_data_tr4 %>%
group_by(region) %>%
summarize(count = n(),
average = round(mean(salary_USD)),
std_err = round(sd(salary_USD) / sqrt(count)),
median = round(median(salary_USD)),
std_err_median = round(((quantile(salary_USD, 0.75) - quantile(salary_USD, 0.25)) / sqrt(n())) * 1.253),
var = round(var(salary_USD)),
std_dev = round(sd(salary_USD))
) %>%
arrange(desc(median))
#Produce a nicely formatted table
kable(region_summaries)| region | count | average | std_err | median | std_err_median | var | std_dev |
|---|---|---|---|---|---|---|---|
| Western Asia | 13 | 101430 | 15365 | 95000 | 10773 | 3068961377 | 55398 |
| Eastern Europe | 8 | 97225 | 16153 | 79317 | 25041 | 2087459753 | 45689 |
| Western Europe | 375 | 86636 | 1907 | 78512 | 2707 | 1363248486 | 36922 |
| Central America, South America & Caribbean | 21 | 75186 | 7244 | 78000 | 9843 | 1102010286 | 33197 |
| Northern America | 24278 | 88496 | 279 | 77000 | 426 | 1895642514 | 43539 |
| NA | 71 | 81929 | 4660 | 75000 | 8108 | 1541693186 | 39264 |
| Southern Asia | 7 | 83538 | 15653 | 70000 | 20337 | 1715015103 | 41413 |
| Northern Europe | 1328 | 77730 | 993 | 66615 | 1307 | 1310439000 | 36200 |
| Australia and New Zealand | 127 | 72741 | 3064 | 65007 | 4021 | 1192658455 | 34535 |
| Africa | 27 | 67507 | 6917 | 65000 | 10784 | 1291894626 | 35943 |
| Southern Europe | 41 | 77216 | 7043 | 63032 | 8041 | 2033738776 | 45097 |
| South-Eastern Asia | 12 | 85667 | 17201 | 53750 | 22110 | 3550560606 | 59587 |
| Eastern Asia | 34 | 67964 | 7688 | 49040 | 8413 | 2009455837 | 44827 |
Median salary per region compared to global median.
global_median = median(salary_data_tr4$salary_USD)
salary_data_tr4 %>%
group_by(region) %>%
summarise(med = median(salary_USD)) %>%
filter(region != 'NA') %>%
ggplot(aes(x = med, y = reorder(region, med), color = med)) +
geom_point(size = 4) +
geom_segment(aes(xend = 40000, yend = region), linewidth = 2) +
geom_text(aes(label = round(med / 1000)), color = "white", size = 2.5) +
scale_x_continuous("", expand = c(0,0), limits = c(40000,100000), position = "top", labels = scales::comma_format(scale = 1e-3)) +
scale_color_gradient(low = "#F08080", high = "#007bc2", guide = "legend") +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(margin = margin(t = 10, b = 10)),
plot.margin = margin(l = 10, r = 20)
) +
labs(title = bquote(bold("Median Salary ") ~ "(in US$1000) per " ~ bold("Region"))) +
geom_vline(xintercept = global_median, color = "grey40", linetype = 3) +
annotate("text", x = global_median + 5000, y = 5.5, label = "The\nglobal\nmedian",
vjust = 1, size = 3, color = "grey40") +
annotate("curve", x = global_median + 5000, y = 5.6, xend = global_median + 1000, yend = 7,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"), color = "grey40")Median salary for each region with associated standard error bars and count values, ranked in descending order of median value
#filter out NA values first
salary_data_tr4_noNA <- salary_data_tr4 %>%
filter(!is.na(region))
# Calculate the median salary and associated SE estimates for each region
median_salaries <- salary_data_tr4_noNA %>%
group_by(region) %>%
summarise(median_salary = median(salary_USD),
se_med_salary = ((quantile(salary_USD, 0.75) - quantile(salary_USD, 0.25)) / sqrt(n())) * 1.253,
count = n())
#plot bar graphs
ggplot(median_salaries, aes(x = reorder(region, -median_salary), y = median_salary)) +
geom_col(fill = "#007bc2") +
geom_errorbar(aes(ymin = median_salary - se_med_salary, ymax = median_salary + se_med_salary), width = 0.1) +
geom_label(aes(x = region, y = 7000, label = count), size = 2.5) +
theme_classic() +
labs(x = "Region", y = "Median Salary (USD)", title = bquote(bold("Median Salary ") ~ "per " ~ bold("Region") ~ " (with SE bars)")) +
theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10), plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5))Side by side boxplots of salary by region (ordered by median salary)
ggplot(salary_data_tr4_noNA, aes(y = salary_USD, x = reorder(region, -salary_USD, FUN = median))) +
geom_boxplot(fill = "blue", color = "black", width = 0.8, alpha = 0.2) +
labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Region", y = "Salary (USD)") +
theme_classic() +
theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10), plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5, size = 16))Side by side dot plots of salary by region (ordered by median salary)
ggplot(salary_data_tr4_noNA, aes(x = salary_USD, y = reorder(region, salary_USD, FUN = median))) +
geom_point(alpha = 0.3, fill = "blue", colour = "blue") +
labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Salary (USD)", y = "Region") +
theme(plot.title = element_text(margin = margin(t = 10, b = 10), size = 16))Side-by-side violin plots of salary by region
ggplot(data = salary_data_tr4_noNA, aes(x = reorder(region, salary_USD, FUN = median), y = salary_USD, fill = region)) +
geom_violin(alpha = .25, color = "grey") +
#optional points superimposed on plots
#geom_jitter(alpha=.10, aes(color=region), position=position_jitter(width=0.3)) +
#geom_label(aes(x = 0, y = region, label = count), size = 2.5) +
labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Region", y = "Salary (USD)") +
#scale_fill_manual(values=cbPalette) +
scale_fill_brewer(palette = "Paired") +
coord_flip() +
theme_classic() +
theme(plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5, size = 16), legend.position = "none")
Overlapping density plots for salary by region.
ggplot(salary_data_tr4_noNA, aes(x = salary_USD, fill = region, color = region)) +
geom_density(alpha = 0.3, linewidth = 0.1) +
labs(title = bquote(bold("Salary") ~ " by" ~ bold("Region")), x = "Salary (USD)", y = "Density") +
theme_classic() +
theme(legend.title = element_blank(), legend.position = "top", legend.text = element_text(size = 8),
legend.key.width = unit(0.5, "cm"), legend.key.height = unit(0.5, "cm"), legend.background = element_rect(fill = "grey90", color = "grey"), plot.title = element_text(hjust = 0.5, size = 16))Comparison density plots of salary by region.
ggplot(salary_data_tr4_noNA, aes(x = salary_USD, fill = region, color = region)) +
geom_density(alpha = 0.4, linewidth = 0.1) +
geom_vline(data = median_salaries, aes(xintercept = median_salary), color = "red", linetype = "dashed") +
facet_wrap(~ region, ncol = 3) + # Separate plots for each level of 'region'
labs(title = bquote(bold("Salary ") ~ "by" ~ bold("Region") ~ "(with median indicators)"), x = "Salary (USD)", y = "Density") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.3, size = 14))Side by side dot plots of salary by region for top earners only (> $400,000):
salary_data2 %>%
filter(salary_USD > 400000) %>%
ggplot(aes(x = salary_USD, y = reorder(region, salary_USD, FUN = max))) +
geom_point(alpha = 0.3, fill = "blue", colour = "blue") +
labs(title = bquote(bold("Salary ") ~ "(USD) by " ~ bold("Region") ~ " (top earners only)"), x = "Salary (USD)", y = "Region") +
theme(plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5))Initial ANOVA
Investigate the significance of differences in mean salary values between regions with an initial ANOVA test.
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
## Df Sum Sq Mean Sq F value Pr(>F)
## region 11 2.114e+11 1.922e+10 10.36 <2e-16 ***
## Residuals 26259 4.872e+13 1.856e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Effect Size Calculation:
# Extracting sums of squares
SS_Between <- anova(anova_region)$Sum[1]
SS_Within <- anova(anova_region)$Sum[2]
SS_Total <- SS_Between + SS_Within
# Calculating effect size measures
eta_squared <- SS_Between / SS_Total
# Displaying effect size measures
cat("Eta-squared:", eta_squared)## Eta-squared: 0.004321007
Statistically significant result (p-value < 0.000001 (6dp)). However, the exploratory analysis suggested violations of the ANOVA assumptions of normality and homoscedasticity.
Exploration required.
Diagnostics
Perform some in-depth diagnostics to check whether the assumptions for ANOVA are met.
A boxplot of the residuals
ggplot() +
geom_boxplot(aes(y = anova_region$res), fill = "skyblue") +
labs(title = "Boxplot of inital ANOVA model residuals", y = "salary (USD)") +
theme_minimal()The Q-Q residuals plot suggests a great deal of positive skew and therefore violation of normality. The box plot of residuals reinforces this trend.
The Residuals vs Fitted plot affirms this with the positive residuals having more spread than the negative residuals. The plot also indicates that there are some differences in variance of residuals across groups. In particular, Eastern Europe, Southern Asia and Central/South America/Caribbeans seem to have particularly less variance than Northern America, Northern Europe and Western Europe (note there is likely a sample size effect present as the latter groups have the greatest number of observations in the dataset, while the former groups have the least).
In-fact, it is very likely that the statistical power of the ANOVA is being severely affected by the weighting of the largest group, Northern America. When observing the exploratory graph below, we notice that the global mean is almost identical to the Northern America mean, reinforcing this region’s overall influence due to dominating the sample size.
It might be necessary to remove those regions with very small sample sizes as this inconsistency in sample sizes interrupts the viability of our results.
global_mean = mean(salary_data_tr4_noNA$salary_USD)
salary_data_tr4 %>%
group_by(region) %>%
summarise(avg = mean(salary_USD)) %>%
filter(region != 'NA') %>%
ggplot(aes(x = avg, y = reorder(region, avg), color = avg)) +
geom_point(size = 4) +
geom_segment(aes(xend = 50000, yend = region), linewidth = 2) +
geom_text(aes(label = round(avg / 1000)), color = "white", size = 2.5) +
scale_x_continuous("", expand = c(0,0), limits = c(50000,105000), position = "top") +
scale_color_gradient(low = "#F08080", high = "skyblue", guide = "legend") +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none",
plot.title = element_text(margin = margin(t = 10, b = 10)),
plot.margin = margin(l = 10, r = 20)
) +
labs(title = bquote(bold("Average Salary ") ~ " (in US$1000) per " ~ bold("Region"))) +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
annotate("text", x = global_mean + 5000, y = 5.5, label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40") +
annotate("curve", x = global_mean + 5000, y = 5.6, xend = global_mean + 1000, yend = 7,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"), color = "grey40")Normality Tests
Our sample size is too large to perform some of the popular normality tests, e.g. Shapiro-Wilk. We can attempt some others:
#Kolmogorov-Smirinov Test
ks.test(salary_data_tr4_noNA$salary_USD, "pnorm", mean = mean(salary_data_tr4_noNA$salary_USD), sd = sd(salary_data_tr4_noNA$salary_USD))##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: salary_data_tr4_noNA$salary_USD
## D = 0.1167, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## Jarque Bera Test
##
## data: salary_data_tr4_noNA$salary_USD
## X-squared = 12962, df = 2, p-value < 2.2e-16
Both the Kolmogorov-Smirinov test (D = 0.1167, p-value < 2.2e-16) and the Jarque-Bera test (X-squared = 12962, p-value < 2.2e-16) results are statistically significant, indicating we reject the null hypothesis that we have a normally distributed variable. Keep in mind, with a very large sample size (n = 26271), we generally have increased statistical power, meaning we detect very small deviations from normality with greater precision, even if the effect size is of very little practical importance.
Equal Variances Tests
Rule of thumb check for equal variances by calculating the ratio between the largest variance (South Eastern Asia: 3,550,560,606) and the smallest variance (Central/South America/Caribbean: 1,102,010,286)
## [1] 3.221894
The resulting value is slightly under the rule of thumb threshold of 4, indicating variances may be similar enough to satisfy the assumptions of ANOVA.
We can conduct some more general tests for homoscedasticity:
Hartley’s Test (F-test ratio check of variance).
index_1 <- which(salary_data_tr4$region == "South-Eastern Asia")
index_2 <- which(salary_data_tr4$region == "Central America, South America & Caribbean")
var.test(salary_data_tr4$salary_USD[index_1], salary_data_tr4$salary_USD[index_2])##
## F test to compare two variances
##
## data: salary_data_tr4$salary_USD[index_1] and salary_data_tr4$salary_USD[index_2]
## F = 3.2219, num df = 11, denom df = 20, p-value = 0.02237
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.184145 10.394297
## sample estimates:
## ratio of variances
## 3.221894
Bartlett’s Test
##
## Bartlett test of homogeneity of variances
##
## data: salary_USD by region
## Bartlett's K-squared = 110.8, df = 11, p-value < 2.2e-16
Levene Test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 11 8.6269 1.952e-15 ***
## 26259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hartley’s test returns a statistically significant result (F = 3.222 (3dp), num df = 11, denom df = 20, p-value = 0.02237). At the 5% level, we can reject the null hypothesis that the true ratio of variances is equal to 1. Note, there is a very large 95% CI for the F-stat ratio, as we have relatively small sample sizes for the groups concerned in the test. The CI almost contains a ratio value of 1.
Bartlett’s test may be more appropriate as we have more than two groups. We get a statistically significant result (K-squared = 110.8, df = 11, p-value < 2.2e-16), suggesting to reject the assumption of equal variances. However, this test is very sensitive to the normality assumption being satisfied.
Levene’s test, a special case of the Brown-Forsythe test, is less sensitive to the normality assumption than Bartlett and Hartley’s test. This is perhaps a more suitable choice. Again, high statistical significance (F = 8.6269, df = 11, p-value = 1.952e-15) suggesting unequal variances.
Conclusion
We have enough evidence to be concerned about both the assumptions of normality and homoscedasticty being violated.
Although, due to the very large sample size, the degree of violation
to normality can be somewhat overlooked, we discovered earlier, a log
transformation of salary_USD will significantly improve
normality and homoscedasticty and is worth proceeding with.
Additionally, to improve the overall equality of sample sizes, and increase statistical power, we will remove some of the smaller region groups.
Sensitivity Analysis
We will perform a sensitivity analysis by initially applying a standard ANOVA model to the transformed dataset, and then iterating through various alternative models as required.
First, transform and mutate the data.
#Update trimmed dataframe with the log-transformed salary_USD variable and rename
salary_data_tr4 <- cbind(salary_data_tr4, salary_USD_tr4_log)
names(salary_data_tr4)[ncol(salary_data_tr4)] = "salary_USD_log"
#extract vector of names of regions to remove
names_reg <- salary_data_tr4 %>%
group_by(region) %>%
count() %>%
filter(n < 30) %>%
pull(region) %>%
as.character()
#remove small region groups and remove NA group
salary_data_tr4_2 <- salary_data_tr4 %>%
filter(!is.na(region) & !region %in% names_reg)Run the ANOVA model on transformed data (and a corresponding linear model for more insight):
Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.
#ANOVA test output
anova_region2 <- aov(salary_USD_log ~ region, data = salary_data_tr4_2)
summary(anova_region2)## Df Sum Sq Mean Sq F value Pr(>F)
## region 5 22 4.374 21.28 <2e-16 ***
## Residuals 26177 5382 0.206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#corresponding linear model output
lm_region2 <- lm(salary_USD_log ~ region, data = salary_data_tr4_2)
summary(lm_region2)##
## Call:
## lm(formula = salary_USD_log ~ region, data = salary_data_tr4_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23510 -0.33249 -0.03174 0.31581 1.35648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.10585 0.04023 276.031 < 2e-16 ***
## regionEastern Asia -0.13629 0.08755 -1.557 0.1196
## regionNorthern America 0.17745 0.04034 4.399 1.09e-05 ***
## regionNorthern Europe 0.07216 0.04211 1.713 0.0866 .
## regionSouthern Europe 0.02347 0.08144 0.288 0.7732
## regionWestern Europe 0.18510 0.04655 3.976 7.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4534 on 26177 degrees of freedom
## Multiple R-squared: 0.004048, Adjusted R-squared: 0.003857
## F-statistic: 21.28 on 5 and 26177 DF, p-value: < 2.2e-16
Effect Size Calculation:
# Extracting sums of squares
SS_Between <- anova(anova_region2)$Sum[1]
SS_Within <- anova(anova_region2)$Sum[2]
SS_Total <- SS_Between + SS_Within
# Calculating effect size measures
eta_squared <- SS_Between / SS_Total
# Displaying effect size measures
cat("Eta-squared:", eta_squared)## Eta-squared: 0.0040477
Alternatives to standard ANOVA
We can attempt some other alternatives as a point of comparison.
Unequal Variances (Welch’s) ANOVA:
Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.
anova_region_welch <- oneway.test(salary_USD_log ~ region, data = salary_data_tr4_2)
anova_region_welch##
## One-way analysis of means (not assuming equal variances)
##
## data: salary_USD_log and region
## F = 25.36, num df = 5.00, denom df = 176.11, p-value < 2.2e-16
Non-parametric Kruskal-Wallis test:
Null Hypothesis: There is no significant difference in median logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in median logarithm of salaries across at least one pair of regions.
ggbetweenstats(data = salary_data_tr4_2, x = region, y = salary_USD_log,
type = "nonparametric", outlier.tagging = TRUE) +
labs(y = "log(salary (USD))", title = "Distributions of log(salary (USD)) by region with Kruskal-Walis and Pairwise tests")Non-parametric Random Permutation Test (using un-transformed data):
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
#First assign the F-value from the previous regular anova model to a new object
anova_f <- summary(anova_region2)[[1]][1,4]
#establish the count for each region group (without NAs)
region_counts <- salary_data_tr4_2 %>%
group_by(region) %>%
summarise(n = n())
#assign a permutation function (data = data frame, n.vec = vector where each element represents the group sample sizes, R = number of resamples)
perm_fun <- function(data, R = 1000){
#assign initial vectors and parameters
mean_overall <- mean(data[,'salary_USD'])
f_vec = numeric(R)
n_vec = as.numeric(region_counts$n)
n <- sum(n_vec)
# Create names using paste0 and letters then assign names to the vector
names_vec <- paste0("n", letters[seq_along(n_vec)])
names(n_vec) <- names_vec
#loop for each re-sample iteration
for(r in 1:R){
#assign new sum of squares within and between vectors
ss_w = numeric(length(n_vec))
ss_b = numeric(length(n_vec))
#loop for each sample subsample of region
for(k in 1:length(n_vec)){
#draw random values from collated dataset
index <- sample(1:n, n_vec[k])
values <- data[index,'salary_USD']
#cacl. the mean, and sum of squares for that random sample
mean_group <- mean(values)
ss_w[k] <- sum((values - mean_group)^2)
ss_b[k] <- (mean_group - mean_overall)^2 * n_vec[k]
}
#calc. the mean ss and then f statistic for that complete permutation
mss_w <- sum(ss_w) / (n - 1)
mss_b <- sum(ss_b) / (length(n_vec) - 1)
f_stat <- mss_b / mss_w
#update the f_vec with this permutation's f_stat
f_vec[r] = f_stat
}
#plot and print the results including p-value, 0.95 quantile (our threshold for significance) and the max value
print(paste("0.95 quantile: ", round(quantile(f_vec, 0.95), 3)))
print(paste("maximum F-value: ", round(max(f_vec), 3)))
print(paste("observed F-value: ", anova_f))
p_value <- mean(f_vec > anova_f)
print(paste("p-value: ", p_value))
#plot a histogram of the sampling distribution with lines representing the 0.05% significance level and the observed f-value
hist(f_vec, main = "Empirical sampling distribution of F statistic\nfrom Random Permutation Test", xlab = "F value", breaks = 50, xlim = c(0,anova_f +5), prob = T)
abline(v = anova_f, col = "red", lty = 2)
text(anova_f, par("usr")[4] - 0.10, "Observed\nF-value", pos = 4, col = "red")
abline(v = quantile(f_vec, 0.95), col = "blue", lty = 2)
text(quantile(f_vec, 0.95), par("usr")[4] - 0.05, "0.95 quantile", pos = 4, col = "blue")
}
#Run function
perm_fun(data = salary_data_tr4_2, R = 1000)## [1] "0.95 quantile: 2.127"
## [1] "maximum F-value: 4.956"
## [1] "observed F-value: 21.2774541989364"
## [1] "p-value: 0"
A benefit of the random permutation test is that we do not need to make assumptions about the sampling distribution of our test (F) statistic, in order to determine the probability of our observed group differences.
The plotted distribution essentially displays the distribution of the F statistic under a null hypothesis that there is no meaningful difference between salary in different regions. The blue lines represents the 0.95 quantile for the null distribution, which we are using as our significance level, and the red line represents the F value obtained from our sample. The observed F-value is far greater than the 0.95 quantile, and the probability of achieving this value, given the null hypothesis, is extremely small (p-value = 0).
Non-Parametric Bootstrap Test (using un-transformed data):
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
#first define the group counts and means
region_stats <- salary_data_tr4_2 %>%
group_by(region) %>%
summarise(n = n(), mean = mean(salary_USD))
#Create boostrap function
boot_fun <- function(data, R = 1000){
#'shift' data for each group so that it is 'centered' over 0, by subtracting the mean of each group from its values
bootstrap_anova_data <- data %>%
group_by(region) %>%
mutate(mean_salary = mean(salary_USD),
mean_diff = salary_USD - mean_salary) %>%
ungroup() %>%
select(mean_diff, region)
bootstrap_anova_data <- as.data.frame(bootstrap_anova_data)
#Set the initial bootstrap conditions and objects
f_vec = numeric(R)
# Create a new vector 'names' by repeating each region 'n' times
group_names <- rep(region_stats$region, times = region_stats$n)
#Run the bootstrap sampling
for(i in 1:R){
#randomly sample with replacement from each group
groupA = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Northern America", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Northern America", "n"]), replace = TRUE)
groupB = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Australia and New Zealand", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Australia and New Zealand", "n"]), replace = TRUE)
groupC = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Eastern Asia", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Eastern Asia", "n"]), replace = TRUE)
groupD = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Northern Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Northern Europe", "n"]), replace = TRUE)
groupE = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Southern Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Southern Europe", "n"]), replace = TRUE)
groupF = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Western Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Western Europe", "n"]), replace = TRUE)
#collate the data
samples_collated = c(groupA, groupB, groupC, groupD, groupE, groupF)
collated_data = data.frame(samples_collated, group_names)
f_vec[i] = oneway.test(samples_collated ~ group_names, data = collated_data, var.equal = T)$statistic
}
#plot and print the results including p-value, 0.95 quantile (our threshold for significance) and the max value
print(paste("0.95 quantile: ", round(quantile(f_vec, 0.95), 3)))
print(paste("maximum F-value: ", round(max(f_vec), 3)))
print(paste("observed F-value: ", anova_f))
p_value <- mean(f_vec > anova_f)
print(paste("p-value: ", p_value))
#create a histogram of the bootstapped sampling distribution with our F-value obtained from the original ANOVA and the 0.95 quantile
hist(f_vec, main = "Empirical sampling distribution of F-statistic\nfrom bootstrap ANOVA", xlab = "F value", breaks = 50, xlim = c(0, anova_f + 5), prob = T)
abline(v = anova_f, col = "red", lty = 2)
text(anova_f, par("usr")[4] - 0.10, "Observed\nF-value", pos = 4, col = "red")
abline(v = quantile(f_vec, 0.95), col = "blue", lty = 2)
text(quantile(f_vec, 0.95), par("usr")[4] - 0.05, "0.95 quantile", pos = 4, col = "blue")
}
#run the function
boot_fun(data = salary_data_tr4_2, R = 1000)## [1] "0.95 quantile: 2.013"
## [1] "maximum F-value: 3.758"
## [1] "observed F-value: 21.2774541989364"
## [1] "p-value: 0"
Like the random permutation test, the bootstrap test does not need to make assumptions about the sampling distribution of the test (F) statistic, in order to determine the probability of our observed group differences.
This test operates by creating a bootsampled F-statistic sampling distribution, where the data was first standardized by subtracting the group mean from each value in that group. The sampling distribution for F represents what we might observe if there were no true differences between groups.
The observed F-value is highly significant again (p-value = 0).
Diagnostics
We can assess the degree of normality and homoscedasticity for the logarithm transformed salary variable.
Normality Tests
#Kolmogorov-Smirinov Test
ks.test(salary_data_tr4_2$salary_USD_log, "pnorm", mean = mean(salary_data_tr4_2$salary_USD_log), sd = sd(salary_data_tr4_2$salary_USD_log))##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: salary_data_tr4_2$salary_USD_log
## D = 0.038035, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## Jarque Bera Test
##
## data: salary_data_tr4_2$salary_USD_log
## X-squared = 501.01, df = 2, p-value < 2.2e-16
Equal Variances Tests
#rule of thumb test
var(salary_data_tr4_2$salary_USD_log[which(salary_data_tr4_2$region == "Eastern Asia")]) / var(salary_data_tr4_2$salary_USD_log[which(salary_data_tr4_2$region == "Northern Europe")])## [1] 1.916818
#Hartley's Test (F-test ratio check of variance)
index_1 <- which(salary_data_tr4_2$region == "Eastern Asia")
index_2 <- which(salary_data_tr4_2$region == "Northern Europe")
var.test(salary_data_tr4_2$salary_USD_log[index_1], salary_data_tr4_2$salary_USD_log[index_2])##
## F test to compare two variances
##
## data: salary_data_tr4_2$salary_USD_log[index_1] and salary_data_tr4_2$salary_USD_log[index_2]
## F = 1.9168, num df = 33, denom df = 1327, p-value = 0.002876
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.237821 3.335939
## sample estimates:
## ratio of variances
## 1.916818
##
## Bartlett test of homogeneity of variances
##
## data: salary_USD_log by region
## Bartlett's K-squared = 84.544, df = 5, p-value < 2.2e-16
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 19.527 < 2.2e-16 ***
## 26177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Diagnostic Plots
The Residuals vs Fitted plot suggests notable improvement in normality and homoscedasticity. The Q-Q Residuals plot indicates some improvement to normality however there is still some discrepancy. Instead of positive skew, as present before the transformation, there appears to be some degree of heavy tails.
Some of the tests for normality and equal variances also suggest some improvement, while some of them indicate the same degree of violation to these assumptions.
The density plots below illustrates somewhat improved normality distributions across regions.
# Calculate the median salary for each region
median_salaries <- salary_data_tr4_2 %>%
group_by(region) %>%
summarise(median_salary = median(salary_USD_log))
#plot
ggplot(salary_data_tr4_2, aes(x = salary_USD_log, fill = region, color = region)) +
geom_density(alpha = 0.4, linewidth = 0.1) +
geom_vline(data = median_salaries, aes(xintercept = median_salary), color = "red", linetype = "dashed") +
facet_wrap(~ region, ncol = 3) + # Separate plots for each level of 'region'
labs(title = bquote(bold("Salary ") ~ "by" ~ bold("Region") ~ "(with median indicators)"), x = "log(Salary (USD))", y = "Density") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5, size = 16))Post Hoc Individual Group Comparisons
Pairwise comparisons made to identify significantly different groups. Non-parametric Dunn’s test utilized, with Holm multiple test corrections for p-values.
Output displays statistically significant results only.
#perform Dunn's Test with Holm correction for p-values
post_hoc_table <- dunnTest(salary_USD ~ region, data = salary_data_tr4_2, method = "holm")
#create a refined table with only the significant comparisons and required elements displayed
post_hoc_table_refined <- as.data.frame(post_hoc_table[[2]]) %>%
select(Comparison, P.adj) %>%
filter(P.adj < 0.05) %>%
rename("P-value" = P.adj) %>%
arrange(desc(`P-value`)) %>%
mutate(
`Effect Size` = NA_real_,
`Median Difference (US$)` = NA_real_
)
#define vectors of salary for specific countries
n_a = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Northern America"]
a_nz = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Australia and New Zealand"]
e_a = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Eastern Asia"]
n_e = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Northern Europe"]
w_e = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Western Europe"]
#Calculate the Probability of Superiority (ps) for each significantly different group comparison and append to current table
post_hoc_table_refined[1,3] <- round(sum(outer(e_a, w_e, ">")) / (length(e_a) * length(w_e)),2)
post_hoc_table_refined[2,3] <- round(sum(outer(e_a, n_a, ">")) / (length(e_a) * length(n_a)),2)
post_hoc_table_refined[3,3] <- round(sum(outer(a_nz, w_e, ">")) / (length(a_nz) * length(w_e)),2)
post_hoc_table_refined[4,3] <- round(sum(outer(a_nz, n_a, ">")) / (length(a_nz) * length(n_a)),2)
post_hoc_table_refined[5,3] <- round(sum(outer(n_e, w_e, ">")) / (length(n_e) * length(w_e)),2)
post_hoc_table_refined[6,3] <- round(sum(outer(n_a, n_e, ">")) / (length(n_a) * length(n_e)),2)
#Calculate the absolute differences in medians of salary_USD between groups
post_hoc_table_refined[1,4] <- median(e_a) - median(w_e)
post_hoc_table_refined[2,4] <- median(e_a) - median(n_a)
post_hoc_table_refined[3,4] <- median(a_nz) - median(w_e)
post_hoc_table_refined[4,4] <- median(a_nz) - median(n_a)
post_hoc_table_refined[5,4] <- median(n_e) - median(w_e)
post_hoc_table_refined[6,4] <- median(n_a) - median(n_e)
#output formatted table
kable(post_hoc_table_refined)| Comparison | P-value | Effect Size | Median Difference (US$) |
|---|---|---|---|
| Eastern Asia - Western Europe | 0.0010674 | 0.29 | -29472.5 |
| Eastern Asia - Northern America | 0.0010165 | 0.31 | -27960.5 |
| Australia and New Zealand - Western Europe | 0.0002662 | 0.36 | -13505.0 |
| Australia and New Zealand - Northern America | 0.0000714 | 0.38 | -11993.0 |
| Northern Europe - Western Europe | 0.0000071 | 0.41 | -11897.0 |
| Northern America - Northern Europe | 0.0000000 | 0.58 | 10385.0 |
Effect sizes help determine the practical significance of the statistically significant differences. The table presents Probability of Superiority (PS) values to quantify this.
PS values from the Dunn test indicate the probability that observations from one group are superior to those from another group, providing a measure of the magnitude of differences between groups. Values closer to the minimum (0) and maximum (1) values indicate larger effect sizes.
Exploratory Analysis
Due to the significant positive skew within each salary distribution, we can refer to the median as a means of central tendency.
Western Asia and Eastern Europe exhibit the highest median salary values of $95,000 (SE = $10,773) and $79,317 (SE = $25,041), respectively. It’s worth noting that both groups have relatively small sample sizes (n = 13 and 8), which is reflected by their large SE values. Western Europe has the next highest median value ($78,512) with a larger sample size (n = 375) and smaller SE ($2,707).
Northern America, the most frequent region overall by a significant margin (n = 24,278—approximately 18x greater than the next most frequent) has the fifth greatest median salary, $77,000 (SE = $426). The bar graphs reveal that it is possible that Western Europe and Northern America actually have higher median salary values than Eastern Europe and Central America, South America & Caribbean due to the uncertainty in their estimates. Southern Asia also has a relatively imprecise estimate of population median (SE = $20,337).
Eastern Asia has the lowest median salary ($49,040, SE = $8,413) followed by South-Eastern Asia ($53,750, SE = $22,110) and Southern Europe ($63,032, SE = $8,041).
The side by side boxplots of salary by region reveal a number of positive outliers remaining in the dataset still. There’s a noticeable trend in that the number of outliers for a region seems proportional to its overall sample size, suggesting all regions may feature a similar distribution shape towards the tails. As a result, we notice the most outliers with Northern America, then Northern Europe, then Western Europe. Western Asia and Central America, South America & Caribbean appear to have negative skew, contrasting the rest of the distributions. This may be a result of their relatively small sample sizes (n = 13 and 21 respectively) not accurately representing the populations as a whole.
The side by side violin plots and density plots display the degree of positive skew present within each region. The density plots illustrate that there is a great deal of overlap between the distributions of each region. Those regions with larger sample sizes—most notably Northern America, Western Europe and Northern Europe—typically display more normal shaped distributions.
Inferential Testing / Sensitivity Analysis
We conducted an analysis of variance (ANOVA) to examine the differences in mean salaries across different global regions. The study aimed to investigate whether there are significant variations in salary levels among various geographic areas.
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
The ANOVA revealed a statistically significant difference in mean salaries among the regions, F(11, 26259) = 10.36, p < 0.000001 (6dp).
The effect size, as measured by eta-squared (η²), was calculated to be 0.004321 (6dp), indicating a small effect of region on salary variability.
We have enough evidence to be concerned about both the assumptions of normality and homoscedasticty being violated.
We performed a sensitivity analysis using alternative statistical methods. First we conducted an analysis of variance (ANOVA) with a log transformed response (and very small sample regions removed). The test investigated whether there are significant variations in logarithm transformed salary levels among various geographic areas.
Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.
The ANOVA revealed a statistically significant difference in mean logarithm of salaries among the regions, F(5, 26177) = 21.28, p < 0.000001 (6dp).
The effect size, as measured by eta-squared (η²), was calculated to be 0.004048 (6dp), indicating a small effect of region on logarithm of salary variability.
The exploratory graphs and diagnostic plots indicate improved satisfaction of the normality and equal variance assumptions from our initial ANOVA. Despite this improvement, some diagnostic tests continue to yield significant violation against normality and equal variances, possibly influenced by the large power persisting in these test. With very unbalanced sample sizes, any amount of heteroscedasticity is of concern.
Therefore we proceed with some alternative tests to take thee considerations into account.
Next we proceed with an Unequal Variances (Welch ANOVA) on our transformed data.
Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.
Welch’s ANOVA revealed a statistically significant difference in mean logarithm of salaries among the regions, F(5, 176.11) = 25.36, p < 0.000001 (6dp).
Welch’s ANOVA result may still not be reliable due to the equal sample sizes assumption being considerably violated. We proceed with some non-parametric alternatives.
A Kruskal-Wallis test is conducted to examine the differences in median logarithm of salaries across different global regions.
Null Hypothesis: There is no significant difference in median logarithm of salaries across different regions.
Alternative Hypothesis: There is a significant difference in median logarithm of salaries across at least one pair of regions.
The Kruskal-Wallis test revealed a statistically significant difference in median logarithm of salaries among the regions, χ2(5) = 126.64, p < 0.000001 (6dp).
A Random Permutation Test using the un-transformed data is conducted.
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
The random permutation test revealed a statistically significant difference in mean salaries among the regions, Fcrit (alpha = 0.05) = 2.17, F = 21.28, p = 0 (6dp).
A Bootstrap Test using the un-transformed data is conducted.
Null Hypothesis: There is no significant difference in mean salaries across different regions.
Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.
The bootstrap test revealed a statistically significant difference in mean salaries among the regions, Fcrit (alpha = 0.05) = 2.12, F = 21.28, p = 0 (6dp).
Individual Group Comparisons
For individual group comparisons, we can assess the results of the non-parametric Dunn’s pairwise tests, which is incorporated in our Kruskal-Walis test output. These tests utilise the Holm correction technique, to make adjustments for multiple comparisons. This control the family wise error rate and maintains the overall probability of making a Type I error (rejecting a true null hypothesis) across all pairwise comparisons at the desired alpha-level of 0.05.
Refer to the table above, which displays output for the six significant differences in median salaries between regions. Note that we obtain the same six significant differences when using a parametric alternative - Tukey’s Honest Significant Differences.
The table is ordered in descending order of practical and statistical significance. Probability of Superiority (PS) scores are used as effect sizes here. These represent the probability that any random value from one region will be greater than than any randomly selected value from the compared region. Therefore the effect size is considered larger as the PS deviates further from 0.5.
We observe the greatest effect sizes between Eastern Asia and Western Europe and Eastern Asia and Northern America. It seems that Eastern Asia has a significantly smaller salary rate in comparison to those regions.
Northern America has significantly higher salary rates than Northern Europe, and Australia and New Zealand.
Note that the groups with considerably high sample sizes, are generally the groups with enough power to detect a true difference in median salary.
Overall, the findings from the sensitivity analysis agree with those obtained from the original ANOVA, providing robust evidence of the influence of global region on the response variable, salary. The consistency of results across different statistical methods enhances our confidence in the validity of the conclusions drawn from this study.
This analysis aimed to investigate factors associated with salary variability.
Specifically, there was interest in whether the following factors have any correlation with expected salaries:
- global region
- education level
- job experience
- age
- gender
Interpretation of Results - Region
Due to the nature of our data, in determining whether there is a significant difference in the salary rates between any global regions, we can rely most heavily on our non-parametric test methods (random permutation test, bootstrap test, Kruskal-Walis test).
Ultimately though, all of these methods, as well as our parametric test methods, provide evidence in support of the alternative hypothesis that there is a significant difference in mean/median salaries across at least one pair of regions.
Going deeper, we can analyse pairwise differences. Northern America and Western Europe appear to be the global regions with the largest reliable median salaries. Note that if we only assess point estimates, Western Asia, Eastern Europe and Central America/South America/Caribbean regions actually have higher sample medians, yet with their much smaller sample sizes, we cannot be confident in these population estimates and this is reflected in their relatively large 95% confidence intervals which extend well below those of Northern America and Western Europe. Both Northern America and Western Europe are the only two regions to have statistically greater median salaries than other regions.
All three of Eastern Asia, Australia/New Zealand and Northern Europe are considered to have significantly lower median salaries than the two front runners. There’s a considerable practical significance with both Australia/New Zealand and Northern Europe having smaller median salary values than Northern America and Western Europe within a range of between US$10,000 and US$15,000. It is well known that USA/Canada and some Western Europe countries like United Kingdom, Germany have strong economies. However, it is interesting to note that these other ‘Western’ regions typically lag behind in regards to expected salaries.
Eastern Asia is coming in between $25,000 and $30,000 lower than the two leading regions-a highly significant result. Especially considering this region consists of some strong economic countries including Japan, and South Korea. It is likely the exchange rate has a considerable effect when making these comparisons in the US dollar currency.
There are a number of other regions with median salaries lower than Northern America and Western Europe (and in fact Australia/NZ and Northern Europe) as well, including Southern Asia, Africa, Southern Europe, and South-Eastern Asia.
However, most of these regions have relatively small sample sizes and were therefore excluded from our analysis. Statistical power was severely compromised for these groups. Observing the data we do have, I would be very inclined to predict that with a more balanced study design, and therefore more consistent sample sizes, we would obtain significant differences for these regions also.
Study Design and Improvements
Some interesting trends were observed and inferred through this analysis. There are, however, some constraints due to the nature of the data collection. It’s likely there is a degree of voluntary-response bias present in the study due to the self-selection nature of the survey.
One consequence of self-selection is very unequal sample sizes across most factor levels. For example, across the region factor, 92% of respondents are from the Northern America region, with the remaining 8% being spread over 11 separate global regions.
As a result, the accuracy of our sample estimates are somewhat unbalanced. Some regions contain very small relative sample sizes, for example South-Eastern Asia (n = 12). This means we have considerably reduced power in our statistical testing and it becomes difficult to determine a true population difference in salary. For those very small groups, we ended up having to discount them entirely from our inferential testing and for those groups remaining, we still needed to proceed with non-parametric statistical techniques that would be robust towards the unbalanced sample sizes.
Another limitation imposed by the survey design is the potential for confounding variables. For example, respondents to the study might over-represent individuals with certain beliefs or traits, such as people who are generally more content with their salary overall, thus skewing our sample estimates.
Some examples of confounding variables include: access to resources, the overall quality of one’s education and general health and well being.
Specific improvements to the study design can be considered, in order to minimize the issues mentioned. Most notably, improving the design to ensure more balanced and adequate sample sizes across factors are obtained. Including questions that extract information from respondents about some of the confounding variables mentioned above could help develop more accurate insights.
This section is included for the sake of reproducability.
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Pacific/Auckland
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] FSA_0.9.5 ggstatsplot_0.12.2 car_3.1-2 carData_3.0-5
## [5] tseries_0.10-55 knitr_1.45 countrycode_1.5.0 lubridate_1.9.3
## [9] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [13] readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 Rmpfr_0.9-5 farver_2.1.1
## [4] statsExpressions_1.5.4 fastmap_1.1.1 bayestestR_0.13.2
## [7] digest_0.6.33 timechange_0.2.0 lifecycle_1.0.4
## [10] multcompView_0.1-10 magrittr_2.0.3 compiler_4.3.2
## [13] rlang_1.1.2 sass_0.4.8 tools_4.3.2
## [16] utf8_1.2.4 yaml_2.3.8 ggsignif_0.6.4
## [19] labeling_0.4.3 curl_5.2.0 TTR_0.24.4
## [22] RColorBrewer_1.1-3 SuppDists_1.1-9.7 abind_1.4-5
## [25] withr_3.0.0 grid_4.3.2 datawizard_0.10.0
## [28] fansi_1.0.6 xts_0.13.2 dunn.test_1.3.5
## [31] colorspace_2.1-0 paletteer_1.6.0 scales_1.3.0
## [34] MASS_7.3-60 zeallot_0.1.0 insight_0.19.10
## [37] cli_3.6.2 mvtnorm_1.2-4 rmarkdown_2.25
## [40] generics_0.1.3 rstudioapi_0.15.0 tzdb_0.4.0
## [43] parameters_0.21.6 cachem_1.0.8 effectsize_0.8.7
## [46] vctrs_0.6.5 boot_1.3-30 jsonlite_1.8.8
## [49] hms_1.1.3 patchwork_1.2.0 ggrepel_0.9.5
## [52] correlation_0.8.4 BWStest_0.2.3 jquerylib_0.1.4
## [55] PMCMRplus_1.9.10 quantmod_0.4.25 glue_1.6.2
## [58] rematch2_2.1.2 stringi_1.8.3 gtable_0.3.4
## [61] prismatic_1.1.1 quadprog_1.5-8 gmp_0.7-4
## [64] munsell_0.5.0 pillar_1.9.0 htmltools_0.5.7
## [67] R6_2.5.1 evaluate_0.23 lattice_0.21-9
## [70] highr_0.10 memoise_2.0.1 bslib_0.6.1
## [73] Rcpp_1.0.12 xfun_0.41 kSamples_1.2-10
## [76] zoo_1.8-12 pkgconfig_2.0.3