Background and Introduction


Purpose

This analysis aims to investigate factors associated with salary variability.

It is important to note that the data was collected through an independent online survey. While I had no influence on the study design, question selection, or participant selection process, I dedicated time, initially, to assessing the survey design and identified specific questions and hypotheses to explore.


Specific Questions


(Please note, this version of the report has been condensed for brevity as its purpose is to act as a sample of my analysis abilities. It only features analysis of the relationship between salary and region. This reduction aims to enhance readability while providing a demonstration of my practical abilities)


Data


The dataset is sourced from a public online survey conducted at www.askamanager.org.

The survey creators were interested in salary variability and trends across different dimensions. Participants self-selected to fill out the survey, and their responses are entirely anonymous.

Beginning in 2021, the dataset became “live” and, to this day, it is continually growing as new people respond to the survey. The analysis for this project applies to data sourced on 17.02.2024.

The raw dataset is available here.


Method

This analysis was conducted using the statistical computing environment, R.

Initially, the raw data frame was imported into R as a CSV file and saved as an object called raw_data.

The analysis proceeded through the following stages:


Reporting

In addition to presenting the analysis results, this report includes a comprehensive account of all R code used for reproducibility and clarity.

For a concise and visual summary of the most notable results only, please refer to this graphical dashboard


Data Preparation and Tidying


Import CSV file as raw_data, and define a new version, salary_data, for analysis.

raw_data <- read.csv("/Users/jameslauder/Desktop/Data Learning/Data Sets/Ask A Manager Salary Survey 2021 (Personal Copy - est. 106.02.24)) - Form Responses 1.csv")
salary_data <- raw_data


Rename variables to more compact names.

column_names = c("timestamp", "age_group", "industry", "job_title", "job_title_context", "salary", "compensation", "currency", "other_currency", "income_context", "country", "us_state", "city", "work_exp_overall", "work_exp_field", "education", "gender", "race")
colnames(salary_data) = column_names


Append a unique ID variable to dataframe.

id <- seq(1,nrow(salary_data))
salary_data <- cbind(id, salary_data)


Check initial structure of data frame.

glimpse(salary_data)
## Rows: 28,014
## Columns: 19
## $ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ timestamp         <chr> "4/27/2021 11:02:10", "4/27/2021 11:02:22", "4/27/20…
## $ age_group         <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-34"…
## $ industry          <chr> "Education (Higher Education)", "Computing or Tech",…
## $ job_title         <chr> "Research and Instruction Librarian", "Change & Inte…
## $ job_title_context <chr> "", "", "", "", "", "", "", "High school, FT", "Data…
## $ salary            <chr> "55,000", "54,600", "34,000", "62,000", "60,000", "6…
## $ compensation      <int> 0, 4000, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, …
## $ currency          <chr> "USD", "GBP", "USD", "USD", "USD", "USD", "USD", "US…
## $ other_currency    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ income_context    <chr> "", "", "", "", "", "", "", "", "", "I work for a Ch…
## $ country           <chr> "United States", "United Kingdom", "US", "USA", "US"…
## $ us_state          <chr> "Massachusetts", "", "Tennessee", "Wisconsin", "Sout…
## $ city              <chr> "Boston", "Cambridge", "Chattanooga", "Milwaukee", "…
## $ work_exp_overall  <chr> "5-7 years", "8 - 10 years", "2 - 4 years", "8 - 10 …
## $ work_exp_field    <chr> "5-7 years", "5-7 years", "2 - 4 years", "5-7 years"…
## $ education         <chr> "Master's degree", "College degree", "College degree…
## $ gender            <chr> "Woman", "Non-binary", "Woman", "Woman", "Woman", "M…
## $ race              <chr> "White", "White", "White", "White", "White", "White"…


Duplicated Values


Full Duplicates

#search for full duplicates (disregard 'id' variable)
sum(duplicated(salary_data[,-1]))
## [1] 0

No full duplicates present. Most likely due to every entry having a different time-stamp at least.


Partial Duplicates

Duplicates after disregarding: id, timestamp, job_title_context, income_context (essentially true duplicates still).

variable_vec <- c("age_group", "industry", "job_title", "compensation", "currency", "other_currency", "country", "us_state", "city", "work_exp_overall", "work_exp_field", "education", "gender", "race", "salary")
data_dup_logical <- duplicated(salary_data[variable_vec]) | duplicated(salary_data[variable_vec], fromLast = TRUE)


paste(sum(data_dup_logical), "partial duplicates?")
## [1] "292 partial duplicates?"


Observe the partial duplicates as a new data frame, data_dup (large output - excluded from report).

data_dup <- salary_data[data_dup_logical,]
head(data_dup)


Sort duplicates dataframe by a series of variables, for easier comparison. Truncate long values. Observe first 20 rows only, for convenience.

#sort dataframe by a series of variables
data_dup_sorted <- data_dup %>%
  arrange(age_group, industry, job_title, compensation, currency, other_currency, country, 
          us_state, city, work_exp_overall, work_exp_field, education, gender, race, salary)

#truncate long values to make easier to read data frame
max_length = 10
data_dup_sorted_tr <- data_dup_sorted %>%
  mutate_at(c("age_group", "industry", "job_title", "job_title_context","compensation", "currency", 
              "other_currency", "income_context", "country", "us_state", "city", "work_exp_overall", 
              "work_exp_field", "education", "gender", "race", "salary"), ~substr(., 1, max_length))

data_dup_sorted_tr[1:20,]
##       id          timestamp age_group   industry  job_title job_title_context
## 1  16604 4/28/2021 18:15:06     18-24 Business o Certified                   
## 2  16615 4/28/2021 18:16:06     18-24 Business o Certified                   
## 3  19523 4/29/2021 10:06:16     18-24 Computing  Software e        Rotational
## 4  19526 4/29/2021 10:07:16     18-24 Computing  Software e        Rotational
## 5  14476 4/28/2021 14:46:47     18-24 Engineerin Structural                  
## 6  14498 4/28/2021 14:50:20     18-24 Engineerin Structural                  
## 7  19280  4/29/2021 8:42:29     18-24        Law  Paralegal                  
## 8  19288  4/29/2021 8:44:21     18-24        Law  Paralegal                  
## 9    416 4/27/2021 11:09:38     18-24 Law Enforc Library As                  
## 10  1770 4/27/2021 11:31:35     18-24 Law Enforc Library As                  
## 11 25341  5/6/2021 13:02:25     18-24 Nonprofits Program Co                  
## 12 25342  5/6/2021 13:02:38     18-24 Nonprofits Program Co                  
## 13 14748 4/28/2021 15:23:15     18-24 Nonprofits Senior Res                  
## 14 14755 4/28/2021 15:23:58     18-24 Nonprofits Senior Res                  
## 15 18243 4/28/2021 23:18:11     18-24     Retail Area sales        Children’s
## 16 18255 4/28/2021 23:22:53     18-24     Retail Area sales        Department
## 17 15541 4/28/2021 17:10:31     18-24 Utilities  Policy Ana                  
## 18 15564 4/28/2021 17:11:12     18-24 Utilities  Policy Ana                  
## 19 26777  9/19/2021 0:37:02     25-34 Accounting Chief of S                  
## 20 26778  9/19/2021 0:38:32     25-34 Accounting Chief of S                  
##     salary compensation currency other_currency income_context    country
## 1   63,000        45000      USD                                      USA
## 2   63,000        45000      USD                                      USA
## 3  118,000         <NA>      USD                               United Sta
## 4  118,000         <NA>      USD                               United Sta
## 5   75,600            0      USD                                     USA 
## 6   75,600            0      USD                                     USA 
## 7   25,000         <NA>      GBP                               United Kin
## 8   25,000         <NA>      GBP                               United Kin
## 9   49,000         1000      USD                               United Sta
## 10  49,000         1000      USD                               United Sta
## 11   46425         <NA>      USD                                      USA
## 12   46425         <NA>      USD                                      USA
## 13  49,500         <NA>      USD                               United Sta
## 14  49,500         <NA>      USD                               United Sta
## 15  49,000         <NA>      USD                    Work typic United Sta
## 16  49,000         <NA>      USD                    Typically  United Sta
## 17  57,000         1100      USD                                       US
## 18  57,000         1100      USD                                       US
## 19   70000         <NA>      USD                               United Sta
## 20   70000         <NA>      USD                               United Sta
##      us_state       city work_exp_overall work_exp_field  education gender
## 1   Minnesota Minneapoli        5-7 years     2 - 4 year Master's d    Man
## 2   Minnesota Minneapoli        5-7 years     2 - 4 year Master's d    Man
## 3    New York   New York       2 - 4 year     1 year or  College de  Woman
## 4    New York   New York       2 - 4 year     1 year or  College de  Woman
## 5    New York New York c       2 - 4 year     2 - 4 year Master's d  Woman
## 6    New York New York c       2 - 4 year     2 - 4 year Master's d  Woman
## 7                Bristol        5-7 years     2 - 4 year High Schoo  Woman
## 8                Bristol        5-7 years     2 - 4 year High Schoo  Woman
## 9  Massachuse     Boston       2 - 4 year     2 - 4 year College de  Woman
## 10 Massachuse     Boston       2 - 4 year     2 - 4 year College de  Woman
## 11   New York     Albany       2 - 4 year     1 year or  College de  Woman
## 12   New York     Albany       2 - 4 year     1 year or  College de  Woman
## 13   Maryland   Bethesda        5-7 years     2 - 4 year College de  Woman
## 14   Maryland   Bethesda        5-7 years     2 - 4 year College de  Woman
## 15      Texas     Austin       1 year or      1 year or  College de  Woman
## 16      Texas     Austin       1 year or      1 year or  College de  Woman
## 17   Virginia Charlottes       2 - 4 year     2 - 4 year College de  Woman
## 18   Virginia Charlottes       2 - 4 year     2 - 4 year College de  Woman
## 19   Kentucky Louisville       8 - 10 yea      5-7 years Master's d  Woman
## 20   Kentucky Louisville       8 - 10 yea      5-7 years Master's d  Woman
##          race
## 1  Asian or A
## 2  Asian or A
## 3       White
## 4       White
## 5       White
## 6       White
## 7       White
## 8       White
## 9       White
## 10      White
## 11 Asian or A
## 12 Asian or A
## 13 Black or A
## 14 Black or A
## 15      White
## 16      White
## 17 Hispanic, 
## 18 Hispanic, 
## 19      White
## 20      White


We notice a very interesting trend in that the majority, if not all duplicates, occurred within very close duration of each other (mostly within a matter of minutes or hours), indicating they are indeed most likely legitimate duplicates from the same respective person each time.

I observed every single one of the duplicates and concluded that they all require removal (153 removals total (0.55% of data)).

We call our new data frame: salary_data2

salary_data2 <- salary_data %>%
  distinct(age_group, industry, job_title, compensation, currency, other_currency, 
           country, us_state, city, work_exp_overall, work_exp_field, education, 
           race, gender, salary, .keep_all = TRUE)


Missing Values - Initial Check


Check number of NA values present for each variable.

sapply(salary_data2, function(x){sum(is.na(x))})
##                id         timestamp         age_group          industry 
##                 0                 0                 0                 0 
##         job_title job_title_context            salary      compensation 
##                 0                 1                 0              7235 
##          currency    other_currency    income_context           country 
##                 0                 3                 2                 0 
##          us_state              city  work_exp_overall    work_exp_field 
##                 0                16                 0                 0 
##         education            gender              race 
##                 0                 0                 0


Check number of blank (whitespace) values present for each variable.

sapply(salary_data2, function(x){sum(grepl("^\\s*$", x))})
##                id         timestamp         age_group          industry 
##                 0                 0                 0                73 
##         job_title job_title_context            salary      compensation 
##                 0             20661                 0                 0 
##          currency    other_currency    income_context           country 
##                 0             27652             24831                 0 
##          us_state              city  work_exp_overall    work_exp_field 
##              4985                 8                 0                 0 
##         education            gender              race 
##               214               168               171


Format Variables


Format timestamp Variable


Format values so that they are recognised as class: ‘POSIXct’.

salary_data2$timestamp <- as.POSIXct(salary_data2$timestamp, format = "%m/%d/%Y %H:%M:%S")


Create a new vector, date, with only the date (no time), and bind to beginning of salary_data2.

date <- as.Date(salary_data2$timestamp, format = "%m/%d/%Y")

salary_data2 <- salary_data2 %>%
  mutate(date = date) %>%
  relocate(date, .before = timestamp)


Format age_group Variable


Convert to factor then check for missing values (NAs or white space).

#convert to ordered factor then view contingency table
salary_data2$age_group <- factor(salary_data2$age_group, ordered = TRUE, levels = c("under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or over"))
table(salary_data2$age_group)
## 
##   under 18      18-24      25-34      35-44      45-54      55-64 65 or over 
##         13       1197      12558       9845       3170        984         94
#confirm there are no NAs or blanks
nas <- sum(is.na(salary_data2$age_group))
blanks <- sum(grepl("^\\s*$" ,salary_data2$age_group))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 0NULL


Format industry Variable


Convert to factor then check for missing values (NAs or white space).

#check whether there are NAs or blanks
salary_data2$industry <- as.factor(salary_data2$industry)
nas <- sum(is.na(salary_data2$industry))
blanks <- sum(grepl("^\\s*$" ,salary_data2$industry))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 73NULL


73 blank values present. Convert these to ‘Unspecified’ values.

salary_data2$industry <- gsub("^\\s*$", "Unspecified", salary_data2$industry, ignore.case = TRUE)

#convert back to factor
salary_data2$industry <- as.factor(salary_data2$industry)


Observe the number of levels for industry.

length(levels(salary_data2$industry))
## [1] 1218


Collapsing of values required to drastically reduce number of levels. Keep the 40 most frequent responses and collapse rest into ‘other’ category.

Observe the existing factor levels arranged by frequency (display top 40).

#set number of levels to retain
industry_levels = 40

#collapse the remainder into singular level called 'other'
collapsed_factor <- fct_lump_n(salary_data2$industry, industry_levels)
collapsed_factor_df <- data.frame(collapsed_factor)
collapsed_factor_df %>% count(collapsed_factor) %>% arrange(desc(n))
##                        collapsed_factor    n
## 1                     Computing or Tech 4662
## 2          Education (Higher Education) 2454
## 3                            Nonprofits 2401
## 4                           Health care 1881
## 5  Government and Public Administration 1879
## 6         Accounting, Banking & Finance 1788
## 7                                 Other 1741
## 8          Engineering or Manufacturing 1686
## 9           Marketing, Advertising & PR 1118
## 10                                  Law 1093
## 11               Business or Consulting  842
## 12        Education (Primary/Secondary)  830
## 13                      Media & Digital  769
## 14                            Insurance  526
## 15                               Retail  500
## 16                    Recruitment or HR  459
## 17             Property or Construction  384
## 18                         Art & Design  353
## 19       Utilities & Telecommunications  352
## 20               Transport or Logistics  301
## 21                                Sales  285
## 22                          Social Work  272
## 23                 Hospitality & Events  260
## 24                        Entertainment  252
## 25              Agriculture or Forestry  138
## 26             Leisure, Sport & Tourism   97
## 27                          Unspecified   73
## 28                           Publishing   56
## 29                              Library   51
## 30                            Libraries   50
## 31                              Biotech   49
## 32           Law Enforcement & Security   42
## 33                       Public Library   35
## 34                             Research   33
## 35                        Manufacturing   26
## 36                      Pharmaceuticals   24
## 37                         Architecture   21
## 38                          Real Estate   21
## 39                       Pharmaceutical   20
## 40                     Public Libraries   20
## 41                               Pharma   17


The following new levels were established, based on an iterative process of combining equivalent levels (essentially the same responses but spelled or formatted differently) and then reassessing contingency table of levels (and making target searches for key words using the code below),

#Archived example code used to find string matches and determine their relative frequency
#matches = grepl("manufac", temp$industry, ignore.case = TRUE)
#temp$industry[matches]

#code used to assign new factor levels
salary_data2$industry <- gsub(".*Librar.*", "Library", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Biotech.*", "Pharmaceutical and Biotechnology", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Pharm.*", "Pharmaceutical and Biotechnology", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Research.*", "Research", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Manufac.*", "Engineering or Manufacturing", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Sci.*", "Sciences", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Estate.*", "Property or Construction", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Vet.*", "Veterinary", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Politic.*", "Politics", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Enviro.*", "Environmental Related", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Oil.*", "Oil and Gas", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Energy.*", "Energy", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Auto.*", "Automotive", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Architect.*", "Architecture", salary_data2$industry, ignore.case = TRUE)
salary_data2$industry <- gsub(".*Food.*", "Food and Beverage", salary_data2$industry, ignore.case = TRUE)


In the end, we retain 40 distinct levels and collapse the rest to make ‘Other’ category, resulting in 41 levels total.

industry_levels = 40
salary_data2$industry <- fct_lump_n(salary_data2$industry, industry_levels)


Format country Variable


Convert to factor then check for missing values (NAs or white space).

#confirm there are no NAs or blanks
salary_data2$country <- as.factor(salary_data2$country)
nas <- sum(is.na(salary_data2$country))
blanks <- sum(grepl("^\\s*$" ,salary_data2$country))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 0NULL


Observe the number of levels for country.

length(levels(salary_data2$country))
## [1] 373


A lot of formatting required, due to inconsistent entry of country names. Get a sense for data by observing the existing factor levels arranged by frequency (display top 40).

#set number of levels to retain before collapsing the remainder into singular level called 'other'
country_levels = 40
collapsed_factor <- fct_lump_n(salary_data2$country, country_levels)
collapsed_factor_df <- data.frame(collapsed_factor)
collapsed_factor_df %>% count(collapsed_factor) %>% arrange(desc(n))
##             collapsed_factor    n
## 1              United States 8909
## 2                        USA 7906
## 3                         US 2591
## 4                     Canada 1560
## 5                      Other  907
## 6             United States   660
## 7                       U.S.  577
## 8                         UK  574
## 9             United Kingdom  542
## 10                      USA   463
## 11                       Usa  445
## 12  United States of America  425
## 13                 Australia  316
## 14             United states  206
## 15                       usa  181
## 16                   Germany  172
## 17                   England  133
## 18             united states  116
## 19                        Us  104
## 20                   Ireland  102
## 21               New Zealand   93
## 22                        Uk   85
## 23                   Canada    75
## 24                Australia    68
## 25           United Kingdom    65
## 26                    France   63
## 27                    U.S.A.   46
## 28 United States of America    44
## 29               Netherlands   43
## 30                     Spain   43
## 31                  Scotland   40
## 32                        us   37
## 33                    Sweden   34
## 34                   Belgium   33
## 35                  England    32
## 36               Switzerland   29
## 37                    canada   27
## 38                     Japan   26
## 39           The Netherlands   25
## 40                       UK    22
## 41                   America   21
## 42                       U.S   21


The following new levels were established, based on an iterative process of combining equivalent levels, with different formats, and then reassessing contingency table of levels (and making target searches for key words using the code below).

Note that there are still many values that remain unformatted, as it simply would take too long to catch every single one.

We leave their values as is, for now, rather than lumping into an ‘Other’ category.

#Example Ccde used to find string matches and determine their relative frequency
#matches = grepl("unite", salary_data2$country, ignore.case = TRUE)
#salary_data2$country[matches]

#Code used to assign new factor levels
salary_data2$country <- gsub(".*UAE.*|.*Emirates.*", "UAE", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*UK.*|.*U.K.*|.*United K.*|.*Great Bri.*|.*England.*|.*Ireland.*|.*Scotland.*|.*Wales.*", "UK", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*US.*|.*U.S.*|.*U. S.*|.*Unite.*", "USA", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Canada.*", "Canada", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Netherlands.*", "Netherlands", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*New Ze.*|.*NZ.*", "New Zealand", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*German.*", "Germany", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Sweden.*", "Sweden", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Switzer.*", "Switzerland", salary_data2$country, ignore.case = TRUE)
salary_data2$country <- gsub(".*Denmar.*", "Denmark", salary_data2$country, ignore.case = TRUE)

#restore 'country' to factor class again
salary_data2$country <- as.factor(salary_data2$country)


Observe the resulting factor levels (most frequent only).

salary_data2 %>% group_by(country) %>% count(country) %>% arrange(desc(n))
## # A tibble: 176 × 2
## # Groups:   country [176]
##    country         n
##    <fct>       <int>
##  1 USA         23401
##  2 UK           1714
##  3 Canada       1668
##  4 Germany       195
##  5 New Zealand   129
##  6 Netherlands    88
##  7 France         63
##  8 Spain          43
##  9 Sweden         41
## 10 Switzerland    38
## # ℹ 166 more rows


Establish New Variable: region


Due to the large number of levels still remaining for the country variable we establish a region variable, by collapsing countries together. This may be more convenient for analysis.

library(countrycode)

#establish a new vector for countries
countries <- salary_data2$country

#convert countries to their associated world region, using, 'countrycode' package. Store in new vector called region
region <- as.factor(countrycode(countries, "country.name", "region23"))

#collapse the Central America and Africa regions, respectively, as there are not enough entries in the seperated groups
region <- gsub("Caribbean|Central America|South America", "Central America, South America & Caribbean", region, ignore.case = TRUE)
region <- gsub(".*Africa.*", "Africa", region, ignore.case = TRUE)

#Join the resulting region vector to our data frame
salary_data2 <- salary_data2 %>%
  mutate(region = as.factor(region)) %>%
  relocate(region, .after = country)


Observe the resulting contingency table for region.

(note: 80 remaining countries, as NA values, that could not be grouped due to their messy format).

salary_data2 %>%
  group_by(region) %>%
  count(region) %>%
  arrange(desc(n))
## # A tibble: 13 × 2
## # Groups:   region [13]
##    region                                         n
##    <fct>                                      <int>
##  1 Northern America                           25071
##  2 Northern Europe                             1819
##  3 Western Europe                               426
##  4 Australia and New Zealand                    129
##  5 Southern Europe                               80
##  6 <NA>                                          80
##  7 South-Eastern Asia                            51
##  8 Eastern Asia                                  49
##  9 Central America, South America & Caribbean    42
## 10 Africa                                        37
## 11 Southern Asia                                 30
## 12 Eastern Europe                                25
## 13 Western Asia                                  22


Format us_state Variable


Convert to factor then check for missing values (NAs or white space).

#confirm there are no NAs or blanks
salary_data2$us_state <- as.factor(salary_data2$us_state)
nas <- sum(is.na(salary_data2$us_state))
blanks <- sum(grepl("^\\s*$" , salary_data2$us_state))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 4985NULL


A lot of blanks present, presumably due to many people not living in USA.

Observe the number of levels for us_state.

length(levels(salary_data2$us_state))
## [1] 133


133 different levels present, due to the addition of state combinations produced from people selecting two or more states.


Format city Variable


Convert to factor then check for missing values (NAs or white space).

#confirm there are no NAs or blanks
salary_data2$city <- as.factor(salary_data2$city)
nas <- sum(is.na(salary_data2$city))
blanks <- sum(grepl("^\\s*$" , salary_data2$city))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 16     blanks: 8NULL


A mixture of NAs and blanks present. Observe the number of levels for city.

length(levels(salary_data2$city))
## [1] 4816


A huge amount of levels for city as it is a free text entry. Simply too much formatting required to clean this variable up to a usable state. If needed we can instead target specific entries within this variable.


Format work_exp_overall and work_exp_field Variables


Convert to ordered factors then confirm whether there are no missing values (NAs or white space).

salary_data2$work_exp_overall <- factor(salary_data2$work_exp_overall, ordered = TRUE, levels = c("1 year or less", "2 - 4 years", "5-7 years", "8 - 10 years", "11 - 20 years", "21 - 30 years", "31 - 40 years", "41 years or more" ))

salary_data2$work_exp_field <- factor(salary_data2$work_exp_field, ordered = TRUE, levels = c("1 year or less", "2 - 4 years", "5-7 years", "8 - 10 years", "11 - 20 years", "21 - 30 years", "31 - 40 years", "41 years or more" ))

nas <- sum(is.na(salary_data2$work_exp_overall))
blanks <- sum(grepl("^\\s*$" , salary_data2$work_exp_overall))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 0NULL
nas <- sum(is.na(salary_data2$work_exp_field))
blanks <- sum(grepl("^\\s*$" , salary_data2$work_exp_field))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 0NULL


No missing values present for either factor.


Format education and gender Variables


Education

Convert education to factor then observe whether there are any missing values (NAs or white space)

salary_data2$education <- as.factor(salary_data2$education)

nas <- sum(is.na(salary_data2$education))
blanks <- sum(grepl("^\\s*$" , salary_data2$education))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 214NULL


214 blank values. Assume these are people with education levels below that of our lowest option, ‘High school’. Convert these values to ‘None’ instead.

#convert blanks to "None"
salary_data2 <- salary_data2 %>%
  mutate(education = if_else(education == "", "None", as.factor(education)))

#finally, convert to ordered factor
salary_data2$education <- factor(salary_data2$education, ordered = TRUE, levels = c("None", "High School", "Some college", "College degree", "Master's degree", "PhD", "Professional degree (MD, JD, etc.)"))


Observe the contingency table.

salary_data2 %>%
  group_by(education) %>%
  count(education) %>%
  arrange(desc(n))
## # A tibble: 7 × 2
## # Groups:   education [7]
##   education                              n
##   <ord>                              <int>
## 1 College degree                     13429
## 2 Master's degree                     8808
## 3 Some college                        2043
## 4 PhD                                 1419
## 5 Professional degree (MD, JD, etc.)  1317
## 6 High School                          631
## 7 None                                 214


Gender

Convert gender to factor then observe whether there are any missing values (NAs or white space)

#confirm there are no NAs or blanks
salary_data2$gender <- as.factor(salary_data2$gender)

nas <- sum(is.na(salary_data2$gender))
blanks <- sum(grepl("^\\s*$" , salary_data2$gender))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 168NULL


168 blank values. Collapse this category into the existing category, ‘Other or prefer not to answer’

salary_data2 <- salary_data2 %>%
  mutate(gender = if_else(gender %in% c("", "Prefer not to answer"), "Other or prefer not to answer", as.character(gender)))

#finally, convert to factor
desired_order <- c("Woman", "Man", "Non-binary", "Other or prefer not to answer")
salary_data2$gender <- factor(salary_data2$gender, levels = desired_order)


Observe the contingency table.

salary_data2 %>%
  group_by(gender) %>%
  count(gender) %>%
  arrange(desc(n))
## # A tibble: 4 × 2
## # Groups:   gender [4]
##   gender                            n
##   <fct>                         <int>
## 1 Woman                         21235
## 2 Man                            5426
## 3 Non-binary                      738
## 4 Other or prefer not to answer   462


Format race Variable


Convert race to factor then observe whether there are any missing values (NAs or white space)

salary_data2$race <- as.factor(salary_data2$race)

nas <- sum(is.na(salary_data2$race))
blanks <- sum(grepl("^\\s*$" , salary_data2$race))
print(cat("NAs:", nas, "   ", "blanks:", blanks))
## NAs: 0     blanks: 171NULL


171 blank values. Combine them with the category: ‘Another option not listed here or prefer not to answer’

salary_data2 <- salary_data2 %>%
  mutate(race = if_else(race == "", "Another option not listed here or prefer not to answer", as.character(race)))

salary_data2$race <- as.factor(salary_data2$race)


50 different levels for race present, due to the multitude of combinations. Assign a new variable: race_main, which collapses the less frequent levels into ‘Other’ category, while keeping the most frequent 13.

#create new vector
race_levels = 13
race_main <- fct_lump_n(salary_data2$race, race_levels)

#join with data frame
salary_data2 <- salary_data2 %>%
  mutate(race_main = race_main, .after = race)


Observe the contingency table

salary_data2 %>%
  group_by(race_main) %>%
  count(race_main) %>%
  arrange(desc(n))
## # A tibble: 14 × 2
## # Groups:   race_main [14]
##    race_main                                                         n
##    <fct>                                                         <int>
##  1 White                                                         23071
##  2 Asian or Asian American                                        1384
##  3 Another option not listed here or prefer not to answer          790
##  4 Black or African American                                       672
##  5 Hispanic, Latino, or Spanish origin                             596
##  6 Hispanic, Latino, or Spanish origin, White                      382
##  7 Asian or Asian American, White                                  343
##  8 Other                                                           174
##  9 Black or African American, White                                123
## 10 Middle Eastern or Northern African, White                        82
## 11 Middle Eastern or Northern African                               69
## 12 Native American or Alaska Native, White                          67
## 13 White, Another option not listed here or prefer not to answer    65
## 14 Native American or Alaska Native                                 43


Format salary Variable


Search for blank values (consisting of zero or more white space characters) present in salary.

sum(grepl("^\\s*$", salary_data2$salary))
## [1] 0


Remove “,” characters from salary values and convert variable to numeric type.

salary_data2$salary <- as.numeric(gsub(",", "", salary_data2$salary))


No NAs present in numeric form of salary.

sum(is.na(salary_data2$salary))
## [1] 0


Format currency Variable


Convert to factor then observe the distribution.

salary_data2 <- salary_data2 %>%
  mutate(currency = as.factor(currency))

salary_data2 %>%
  group_by(currency) %>%
  count(currency) %>%
  arrange(desc(n))
## # A tibble: 11 × 2
## # Groups:   currency [11]
##    currency     n
##    <fct>    <int>
##  1 USD      23205
##  2 CAD       1661
##  3 GBP       1584
##  4 EUR        638
##  5 AUD/NZD    499
##  6 Other      157
##  7 CHF         37
##  8 SEK         37
##  9 JPY         23
## 10 ZAR         16
## 11 HKD          4


A wide range of currencies present.

To proceed with the analysis, we need to standardize salary by converting all values to USD.


Establish new variable salary_USD


Use the relevant exchange rate from the specific date of each respective survey entry.

Historical exchange rate data was sourced from Reserve Bank of New Zealand website, and uploaded as a CSV file.

The data dates back to January 2018, which covers the duration of this survey.

After importing the data, some initial formatting takes place.

exchange_data <- read.csv("/Users/jameslauder/Desktop/Data Learning/Personal Projects/Salary Survey Analysis/Exchange rates.csv")

exchange_data2 <- exchange_data %>%

    #Rename variables
  setNames(c("date", "TWI", "USD", "GBP", "AUS", "JPY", "EUR", "CAD", "KRW", "CNY", "MYR", "HKD", "IDR", "THB", "SGD", "TWD", "INR", "PHP", "VND", "his_TWI")) %>%

    #format date column
  mutate(date = as.Date(date, format = "%d %b %Y")) %>%

    #remove commas from IDR and VND columns
  mutate_at(vars(13:19), ~gsub(",", "", .)) %>%

    #format currency conversion columns
  mutate_at(vars(2:20), as.numeric) %>%

    #Trim first four redundant rows off
  slice(5:n())


Join the new exchange_data2 data frame with the survey_data2 data frame, using a left join, with date as the key variable

# Perform a Left join, by date, to align the relevant daily exchange rates with each survey entry
salary_data2_join <- salary_data2 %>%
  left_join(exchange_data2, by = "date")


Note that there are multiple survey entries with no daily exchange rate data attached, due to their being no data for these specific days in our exchange rate data frame.

Remedy this by using the exchange rate data for the next available date.

#set initial parameters
#index for 1st variable to fill
start = which(colnames(salary_data2_join) == "TWI")

#index for last variable to fill
end = which(colnames(salary_data2_join) == "his_TWI")

#vector of the variables to fill
cols <- colnames(salary_data2_join)[start:end]

#vector of indeces of all the NA values (note all other variables have identical NA pattern to 'USD')
NAs <- which(is.na(salary_data2_join$USD))

#vector of indeces of not NA values
not_NAs <- which(!is.na(salary_data2_join$USD))

#Run a for loop which does the following: 
#for each concerned variable
for(i in 1:length(cols)){
  colname = cols[i]
  
  #for every observation with an NA
  for(j in NAs){
    
    #assign the index of the next true value to use as a replacement
    next_value = min(not_NAs[not_NAs > i])
    
    #Perform the replacement
    salary_data2_join[[colname]][j] = salary_data2_join[[colname]][next_value]
  }
}


There are a few relevant currencies which the data_exchanges2 data frame did not have data on.

Bind additional columns on for those currencies. Use a more crude method—averaged out exchange rate, over a relevant period.

salary_data2_join <- salary_data2_join %>%
  mutate(NZD = 1, .after = VND) %>%
  mutate(CHF = 0.6465, .after = VND) %>%
  mutate(ZAR = 10.4566, .after = VND) %>%
  mutate(SEK = 6.04, .after = VND)


Create a new variable called salary_USD with all salary values converted from native currency to USD.

Perform the salary conversions using a for loop and if/else statements.

Note that the exchange rate data only includes the exchange rates between various countries and NZD, therefore first convert all currencies to NZD, then subsequently convert to USD.

#Create empty vector to be populated with converted USD salary values
salary_USD = numeric(nrow(salary_data2_join))

#for every observation
for(i in 1:nrow(salary_data2_join)){
  
  #if currency value is not "other"
  if(salary_data2_join[i,"currency"] != "Other"){
    
    #extract the type of currency
    currency_value <- as.character(salary_data2_join[i,"currency"])
  
   #Convert any values of "AUD/NZD" to "NZD" so that we can then access that  specific column's exchange rate (note we do not actually have any Australian entries, interestingly)
    if(currency_value == "AUD/NZD"){
      currency_value = "NZD"
    }
   
    #reference the applicable exchange rate by using the 'currency' value to select the appropriate column
    exchange_value <- salary_data2_join[[currency_value]][i]
  
    #extract the salary value for this observation
    salary <- salary_data2_join[i,"salary"]

    #perform calculation to convert native salary to USD
    salary_USD[i] <- (salary / exchange_value) * salary_data2_join[i,"USD"]
  
  #otherwise, if "Other", #set value to NA for now
  }else{
    salary_USD[i] = NA
  }
}

#round the resulting variable and bind to initial data frame
salary_data2 <- salary_data2 %>%
  mutate(salary_USD = round(salary_USD)) %>%
  relocate(salary_USD, .after = salary)



Univariate Analysis


salary_USD


Exploratory Analysis


Initial Observations and Outlier Treatment

Observe a summary of salary_USD quantiles.

summary(salary_data2$salary_USD)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0     54015     75000     92117    107000 102000000       157


Boxplot assessment.

boxplot(salary_data2$salary_USD, main = "Distribution of salary_USD", ylab = "salary_USD")

One extremely large value affecting the visualization.

Observe the largest values to gain some more perspective.

#Observe the top 20 values of the data frame
salary_data2 %>% 
  select(salary_USD, currency, country, job_title) %>%
  mutate(job_title = substr(job_title, 1, 40)) %>%
  arrange(desc(salary_USD)) %>%
  slice_head(n = 20)
##    salary_USD currency   country                                job_title
## 1   102000000      USD  Colombia                       Operations Manager
## 2     5000044      USD       USA                     Inside sales manager
## 3     3000000      USD       USA                           Owner and CEO 
## 4     2111538      USD Singapore                          Product Manager
## 5     1900000      USD       USA Attending Physician (general internal me
## 6     1650000      USD       USA              Principal Software Engineer
## 7     1334782      USD       USA                    Senior Policy Advisor
## 8     1318422      GBP        UK                               Consultant
## 9     1300000      USD       USA                                  Partner
## 10    1263782      GBP        UK Agricultural Supply line Negotiating con
## 11    1260000      USD       USA                        Senior Consultant
## 12    1250000      USD       USA                        Marketing Manager
## 13    1200000      USD       USA                                  Partner
## 14    1126141      CHF     India                   Sr. Research associate
## 15    1100000      USD       USA                        Software Engineer
## 16    1100000      USD       USA                                  Partner
## 17    1026691      EUR   Germany                           Lead developer
## 18     956203      GBP   Norway                                 Purchaser
## 19     954000      USD       USA             Business Systems Analyst III
## 20     950000      USD       USA           Director, Software Development


Very large values present. Assume there is a mistake with the currency value for the unrealistic top salary value of $102,000,000. Respondent may have mistakingly selected USD. Therefore, currency conversion algorithm has not occurred. Manually convert from Peso to USD using relevant exchange rate. Resulting value of US$25500 is much more realistic.

salary_data2 <- salary_data2 %>%
  mutate(salary_USD = ifelse(salary_USD == 102000000, salary_USD * 0.00025, salary_USD))


While addressing unrealistic values, perform a quick initial check of lowest values.

salary_data2 %>%
  select(salary_USD, currency, country, job_title) %>%
  mutate(job_title = substr(job_title, 1, 40)) %>%
  filter(salary_USD <= 1000) %>%
  arrange(salary_USD)
##     salary_USD currency     country                                job_title
## 1            0      USD         USA                                 "mum" ;)
## 2            0      USD         USA                      Executive Director 
## 3            0      USD         USA                                 Attorney
## 4            0      USD         USA                          Student teacher
## 5            0      USD         USA                         Product Marketer
## 6            0      USD         USA                             Househusband
## 7            0      USD         USA                                  Founder
## 8            0      USD         USA                              Unemployed 
## 9            0      USD         USA            Government Relations Director
## 10           0      USD         USA                                  Realtor
## 11           0      USD          na                                       na
## 12           0      USD         USA                           College Senior
## 13           0      USD         USA                                  Student
## 14           0      USD         USA                               Homemaker 
## 15           0      USD         USA                     data science student
## 16           0      USD         USA                                      N/A
## 17           1      USD         USA                Software Development Lead
## 18           3      ZAR       spain                 chief executive official
## 19           4      USD          ss                                     sdsd
## 20          15      USD      dbfemf                                    mn jj
## 21          22      EUR Netherlands                               Programmer
## 22          35      USD         USA                       Operations Manager
## 23          36      EUR       spain                                  Auditor
## 24          36      USD         USA                    Marketing Coordinator
## 25          38      USD         USA                Special Education Teacher
## 26          39      GBP          UK                          Media executive
## 27          40      EUR     Germany        Project Manager / Account Manager
## 28          40      USD         USA                      Assistant Registrar
## 29          40      USD         USA              HR Administrator/Generalist
## 30          40      USD         USA                Community Project Manager
## 31          40      USD         USA                        Territory Manager
## 32          42      EUR     finland                   occupational therapist
## 33          42      CAD      Canada                               Technician
## 34          43      EUR Netherlands                Compliance & Risk Officer
## 35          43      EUR          UK                         Games Specialist
## 36          44      USD         USA                                   Scribe
## 37          45      EUR      France                                     FP&A
## 38          45      USD         USA                        Finance Assistant
## 39          48      EUR       Italy                         Cloud Consultant
## 40          50      USD         USA                       Content Specialist
## 41          52      USD         USA                Digital Media Coordinator
## 42          53      USD         USA                     Chief Fiscal Officer
## 43          54      USD         USA                             Art Director
## 44          55      USD         USA            Personal Executive Assistant 
## 45          55      USD         USA                          Account Manager
## 46          55      USD Switzerland                VM BACKOFFICE COORDINATOR
## 47          55      USD         USA               Talent Acquisition Partner
## 48          55      USD         USA                Customer service manager 
## 49          55      USD         USA               Web Customer Service Rep. 
## 50          55      GBP          UK                    accreditation manager
## 51          56      CAD      Canada                                  Analyst
## 52          56      USD         USA                                    homie
## 53          57      USD         USA                         Program director
## 54          57      USD         USA                       Content Specialist
## 55          58      USD         USA                   Quality Assurance Lead
## 56          60      USD         USA                           Store Manager 
## 57          60      EUR     Germany                       Academic Librarian
## 58          61      USD         USA                          Managing Editor
## 59          62      EUR       spain                          project manager
## 60          65      USD         USA                                  Teacher
## 61          65      USD         USA                           supply Planner
## 62          66      EUR     Germany                       Software Developer
## 63          69      GBP          UK                Head of Business Support 
## 64          70      USD         USA                         Regional Manager
## 65          70      USD         USA          Infection Prevention Associate 
## 66          70      USD     America                   Freelance Art Director
## 67          71      GBP          UK                         Security Analyst
## 68          72      USD         USA                         Technical Writer
## 69          72      EUR Netherlands                        Frontend Engineer
## 70          73      CAD      Canada Director of Prevention and Early Interve
## 71          74      CAD      Canada                                Associate
## 72          76      EUR     Germany            Senior Developer / Consultant
## 73          76      USD         USA                                       vv
## 74          78      USD         USA                                      MSW
## 75          79      EUR     Germany                            Data Manager 
## 76          80      USD         USA                             Facilitator 
## 77          80      USD         USA                       Production editor 
## 78          80      USD         USA                               Job title 
## 79          90      GBP          UK          Senior Softwaare Engineer (Web)
## 80          90      USD         USA                                       RN
## 81          95      CAD      Canada             Change Management Consultant
## 82          97      EUR     Germany                          Strategic Buyer
## 83         100      USD         USA                                 Regional
## 84         105      USD         USA                 Senior Marketing Manager
## 85         105      USD         USA                         Registered Nurse
## 86         108      USD         USA                   Engineering Supervisor
## 87         110      USD         USA                     Health Administrator
## 88         121      EUR     Germany                                      CTO
## 89         121      EUR     Germany                  Senior Support Engineer
## 90         130      USD         USA                     Chief Data Scientist
## 91         130      USD         USA                                    Coach
## 92         145      USD         USA                        Software Engineer
## 93         150      USD         USA                    Senior Data Scientist
## 94         155      USD         USA                                  Manager
## 95         160      USD      Canada                          General manager
## 96         180      USD         USA                  Director of Operations 
## 97         185      USD         USA Senior Director, ESG and Investor Relati
## 98         240      USD         USA                     Chief Data Scientist
## 99         253      JPY       Japan                          Travel designer
## 100        350      USD         USA                                  Counsel
## 101        456      USD    Nigeria                                   Ibterb 
## 102        500      USD       Ghana                           Data scientist
## 103        880      USD       India                             Online Tutor
## 104       1000      USD         USA            Teaching practicum supervisor


There are 104 salary values of $1000 or less, with a number of them being values of $0.

Reasonable to assume that these people were using shorthand and reporting salary in terms of $1000s which is sometimes the norm: i.e $40 really means $40k or $40,000. Country and Job title values for these respondents are not contradictory to this theory.

Transform all of the values below $240 (to be somewhat conservative), by multiplying them by a factor of 1000.

salary_data2 <- salary_data2 %>%
    mutate(salary_USD = ifelse(salary_USD < 240, salary_USD * 1000, salary_USD))


Now observe the smallest values again, this time those salary values below $20,000.

salary_data2 %>%
  select(salary_USD, currency, country, job_title) %>%
  mutate(job_title = substr(job_title, 1, 30)) %>%
  filter(salary_USD <= 20000) %>%
  arrange(salary_USD)
##     salary_USD currency                country                      job_title
## 1            0      USD                    USA                       "mum" ;)
## 2            0      USD                    USA            Executive Director 
## 3            0      USD                    USA                       Attorney
## 4            0      USD                    USA                Student teacher
## 5            0      USD                    USA               Product Marketer
## 6            0      USD                    USA                   Househusband
## 7            0      USD                    USA                        Founder
## 8            0      USD                    USA                    Unemployed 
## 9            0      USD                    USA  Government Relations Director
## 10           0      USD                    USA                        Realtor
## 11           0      USD                     na                             na
## 12           0      USD                    USA                 College Senior
## 13           0      USD                    USA                        Student
## 14           0      USD                    USA                     Homemaker 
## 15           0      USD                    USA           data science student
## 16           0      USD                    USA                            N/A
## 17         240      USD                    USA           Chief Data Scientist
## 18         253      JPY                  Japan                Travel designer
## 19         350      USD                    USA                        Counsel
## 20         456      USD               Nigeria                         Ibterb 
## 21         500      USD                  Ghana                 Data scientist
## 22         880      USD                  India                   Online Tutor
## 23        1000      USD                    USA      Software Development Lead
## 24        1000      USD                    USA  Teaching practicum supervisor
## 25        1442      USD                    USA Associate Team Lead - Revenue 
## 26        1727  AUD/NZD                    USA                  Private Tutor
## 27        2072      ZAR           South Africa                        Manager
## 28        2072      ZAR           South Africa                        Manager
## 29        2126      EUR            Netherlands                      Marketing
## 30        2350      USD                    USA Graduate Research Assistant - 
## 31        2400      USD               Pakistan      Quality Assurance Auditor
## 32        2897      EUR                Belgium Customer Service/Tech Support 
## 33        3000      ZAR                  spain       chief executive official
## 34        3055      GBP                     UK       Library customer advisor
## 35        3200      USD                    USA Registration and enrollment of
## 36        3348      SEK                    USA               Paraprofessional
## 37        3592      EUR                    USA                        Soldier
## 38        4000      USD                    USA            freelance captioner
## 39        4000      USD                    USA               Property Manager
## 40        4000      USD                     ss                           sdsd
## 41        4100      USD                    USA     patient services associate
## 42        4800      USD                    USA Sales Development Representati
## 43        5100      USD                    USA                 Medical Biller
## 44        5100      USD                    USA Assistant Director of Marketin
## 45        5265      USD              Malaysia                              QC
## 46        5700      USD                    USA    Accounts Payable Specialist
## 47        6000      USD                    USA                         server
## 48        6111      GBP                     UK    Clinical Evaluation Manager
## 49        6209      USD                    USA        Senior Research Analyst
## 50        6400      USD                    USA                 Chief of Staff
## 51        6555      EUR                Germany             Narrative Designer
## 52        6600      USD                  India                        Analyst
## 53        7250      USD               Pakistan                  iOS Developer
## 54        7390      USD                Nigeria Senior Analyst, Management Con
## 55        7847      EUR Bosnia and Herzegovina   Marketing Automation Analyst
## 56        8000      USD                    USA            Development Officer
## 57        9000      USD                    USA                  Online Editor
## 58        9160      GBP                     UK          freelance illustrator
## 59        9600      USD                   Cuba              Frontend Engineer
## 60        9658      EUR                  Spain                    Translator 
## 61       10000      USD                    USA                  Sr Consultant
## 62       10000      USD                    USA                        Student
## 63       10382      EUR                Romania           Technology Associate
## 64       10500      USD                    USA              Systems Engineer 
## 65       10616      JPY                  Japan   Compliance / Privacy Manager
## 66       10700      USD                    USA Software Engineer Technical Su
## 67       11000      USD                    USA              software engineer
## 68       11000      USD               Pakistan               Technical Writer
## 69       11100      USD                    USA           Resident's Assistant
## 70       11500      USD                    USA            Assistant Professor
## 71       11800      USD                    USA                       Producer
## 72       12000      USD                    USA                    Bookkkeeper
## 73       12000      USD                    USA International Board Certified 
## 74       12000      USD                    USA               Proposal manager
## 75       12000      USD                    USA  Small Business Administrator 
## 76       12000      USD                    USA                       Engineer
## 77       12000      USD                  India                        Analyst
## 78       12000      USD                    USA               Assistant Editor
## 79       12000      USD                    USA                Music Librarian
## 80       12000      USD                     UK            Senior illustrator 
## 81       12000      USD                    USA               Paraprofessional
## 82       12000      USD                    USA             Associate Attorney
## 83       12000      USD                    USA                             TA
## 84       12093      CAD                 Canada Administrative assistant (part
## 85       12559      GBP                     UK          Chartered accountant 
## 86       12600      USD                    USA             Substitute teacher
## 87       13000      USD                    USA             Restaurant manager
## 88       13000      USD                    USA   Manager of Customer Support 
## 89       13000      USD                    USA          Host/Server Assistant
## 90       13000      USD                    USA                         Server
## 91       13000      USD                    USA                      Associate
## 92       13000      USD                    USA                        Retired
## 93       13000      USD                    USA Transactional (Contracts) Atto
## 94       13045      EUR                 Greece              Software Engineer
## 95       13300      USD                    USA             Executive Director
## 96       13400      USD                    USA             Graduate Assistant
## 97       13500      USD                    USA       Growth Marketing Manager
## 98       13500      USD                    USA            Freelance Consutant
## 99       13520      USD                    USA                     specialist
## 100      13560      USD                 Kuwait           Instructional coach 
## 101      13600      USD                    USA                         Server
## 102      13763      EUR                Germany Apprentice Mechanic for Constr
## 103      13888      GBP                     UK       Early Years Practitioner
## 104      14000      USD                    USA Special education teaching ass
## 105      14000      USD                    USA                         Author
## 106      14000      USD                    USA                       waitress
## 107      14000      USD                    USA     Talent Development Manager
## 108      14453      ZAR           South Africa                       Sysadmin
## 109      14500      USD                    USA                    Senior Tech
## 110      14503      CAD                 Canada   Freelance artist/illustrator
## 111      14512      CAD                 Canada                Research Fellow
## 112      14559      EUR                 Latvia                  hobby teacher
## 113      14572      GBP                     UK                  Administrator
## 114      14850      USD                    USA             Adjunct Instructor
## 115      14920      ZAR           South Africa                         Editor
## 116      15000      USD                    USA                         writer
## 117      15000      USD                    USA               Graduate Student
## 118      15000      USD                    USA                     Assistant 
## 119      15000      USD                    USA Adjunct professor and PhD stud
## 120      15000      USD                    USA    Graduate Teaching Assistant
## 121      15000      USD                    USA             Adjunct Instructor
## 122      15000      USD                    USA               Graduate Student
## 123      15000      USD                    USA             Assistant Director
## 124      15000      USD                    USA Graduate Assistant for Athleti
## 125      15000      USD                 dbfemf                          mn jj
## 126      15000      USD               Colombia                  Profesional I
## 127      15080      USD                    USA                        Cashier
## 128      15080      USD                    USA                      Line Cook
## 129      15100      USD                    USA             Operations Manager
## 130      15276      GBP                     UK                        Cleaner
## 131      15400      USD                    USA              Americorps Member
## 132      15461      EUR               Portugal      L2 Cyber Security Analyst
## 133      15600      USD                    USA        Studio/Touring Musician
## 134      15758      EUR                Germany                         Dancer
## 135      16000      USD                    USA                 Student Worker
## 136      16000      USD                    USA                       Director
## 137      16000      USD                    USA Education and Outreach Coordin
## 138      16000      USD                    USA                         Writer
## 139      16000      USD                    USA       Administrative Assistant
## 140      16000      USD                    USA                            CFO
## 141      16115      CAD                 Canada                  Photographer 
## 142      16200      USD                    USA       Senior Software Engineer
## 143      16200      USD                  Chile            Mathematics Teacher
## 144      16500      USD                    USA      Vice President, Analytics
## 145      16578      ZAR           South Africa System administration and proj
## 146      16640      USD                    USA           Recruiting Assistant
## 147      16800      USD                    USA                 Literacy Tutor
## 148      16902      EUR                 Greece    Administrative coordinator 
## 149      16970      EUR                  Spain                     Researcher
## 150      17000      USD                    USA             Research assistant
## 151      17000      USD                    USA Undergraduate Research Assista
## 152      17000      USD                    USA                      Secretary
## 153      17200      USD                    USA Social Studies and Technology 
## 154      17300      USD                    USA Project Coordinator/Communicat
## 155      17360      GBP                     UK                       Director
## 156      17384      EUR                  Spain            Frontend programmer
## 157      17472      USD                    USA              Assistant Manager
## 158      17500      USD                    USA    Graduate Research Assistant
## 159      17505      EUR                  Spain       Administrative assistant
## 160      17680      USD                    USA      Special Events Assistant 
## 161      17747      EUR              Lithuania                      Assistant
## 162      17843      EUR                 France          High Ropes Instructor
## 163      18000      USD                    USA                Sales Associate
## 164      18000      USD                    USA         Circulation Supervisor
## 165      18000      USD                    USA              Graduate Student 
## 166      18000      USD                    USA                     cartoonist
## 167      18000      USD                    USA             Software Architect
## 168      18000      USD                    USA PhD student/teaching assistant
## 169      18000      USD                    USA               Office Associate
## 170      18000      USD                    USA     Graduate Research Assitant
## 171      18000      USD                    USA         Senior Director, Sales
## 172      18000      USD                    USA             Substitute Teacher
## 173      18000      USD                    USA                     Bus Driver
## 174      18021  AUD/NZD                    USA              Library Assistant
## 175      18034      EUR                 France Field technician in archaeolog
## 176      18054      GBP                     UK        Pupil Support Assistant
## 177      18109      EUR                     UK            Marketing assistant
## 178      18500      USD                    USA              Graduate Student 
## 179      18500      USD                    USA Consumer directed care attenda
## 180      18532      CAD                 Canada              Optical assistant
## 181      18543      CAD                 Canada                   Union clerk 
## 182      18566      EUR                  Italy                    PhD student
## 183      18720      USD                    USA     Cashier/back-end food prep
## 184      18720      USD                    USA                  Cashier/Clerk
## 185      18720      USD                    USA                   Page/Shelver
## 186      18720      USD                    USA                        Stocker
## 187      18720      USD                    USA                        Student
## 188      19000      USD                    USA              Teacher Assistant
## 189      19000      USD                    USA        Arts Education Director
## 190      19000      USD                    USA Special Education Paraprofessi
## 191      19065      ZAR           South Africa                   Data Analyst
## 192      19108      CAD                 Canada        Freelance Writer/Editor
## 193      19200      USD                    USA                Admin Assistant
## 194      19276      EUR                 France        Java Software Developer
## 195      19395      EUR                  Italy                       Post Doc
## 196      19500      USD                    USA                Museum Director
## 197      19502      EUR                  Spain             marketing designer
## 198      19583      USD                    USA            Executive Assistant
## 199      19760      USD                    USA                Order Processor
## 200      19885      USD                    USA                Sales Associate
## 201      19978      USD                    USA            Reference Librarian
## 202      19998      GBP                     UK            Pharmacy Technician
## 203      20000      USD                    USA               Graduate Student
## 204      20000      USD                    USA             Part-time Faculty 
## 205      20000      USD                    USA                     Cataloguer
## 206      20000      USD                    USA               Paraprofessional
## 207      20000      USD                    USA                     Dishwasher
## 208      20000      USD                    USA                       Cashier 
## 209      20000      USD                    USA          Social media producer
## 210      20000      USD                    USA                          Tutor
## 211      20000      USD                    USA                Unit secretary 
## 212      20000      USD                    USA                        Adjunct
## 213      20000      USD               Zimbabwe                        manager


We still notice many unexpectedly small values (as well as those $0 values). Some may be due to the weaker economies of those countries, however some appear to be human error.

Based on the minimum hourly wage of multiple countries, and some simple calculations, I decide to remove observations from a series of developed countries, that fall below a conservative salary of US$23,000.

We subsequently also remove a small set of salaries that fall below $11,000, that were originally in either JPY, SEK or ZAR, as these values do not seem realistic either.

salary_data2 <- salary_data2 %>%
  filter(salary_USD > 23000 | (salary_USD < 23000 & !(currency %in% c("USD", "GBP", "EUR", "AUD/NZD", "CAD"))))

salary_data2 <- salary_data2 %>%
  filter(salary_USD > 11000)


Rechecking salary_USD distribution after initial outlier treatment


Re-check the distribution by means of boxplot and histogram.

summary(salary_data2$salary_USD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14453   55000   75000   89394  108000 5000044
boxplot(salary_data2$salary_USD, main = "Distribution of salary_USD after intitial outlier removal")

hist(salary_data2$salary_USD, xlab = "salary_USD", main = "Distribution of salary_USD after intitial outlier removal", breaks = 500)

Extreme positive skew still present.

Observe histogram (zoomed into the lower section of y axis) to observe potential outliers still present.

breaks <- seq(min(salary_data2$salary_USD, na.rm = TRUE), max(salary_data2$salary_USD, na.rm = TRUE), length.out = 100)
ggplot(salary_data2, aes(x = salary_USD)) + 
  geom_histogram(breaks = breaks, fill = "#007bc2", col = "white") +
  coord_cartesian(ylim = c(0, 15)) +
  labs(title = "Distribution of salary_USD (zoomed into the lower section of y axis)")


Histograms of data split into two parts to aid visualization of the low frequency positive outliers.

#set new parameter to display graphs side by side
par(mfrow = c(1, 2))

#histogram of trimmed salary data (up to $300,000)
data = salary_data2$salary_USD
breaks = seq(min(data), 300000, length.out = 80)
subset = (data[data <= max(breaks)])
hist(subset, breaks = breaks, col = "#007bc2", xlab = "Salary (USD)", main = "Salary Distribution (from $23,000 to $300,000)", cex.main = 0.6)

#histogram of trimmed salary data (from 303000 to max)
breaks = seq(300000, max(data), length.out = 80)
subset = (data[data >= min(breaks)])
hist(subset, breaks = breaks, col = "#007bc2", xlab = "Salary (USD)", main = "Salary Distribution (from $300,000 to $5,000,044)", cex.main = 0.6)

Index plot of data ordered by salary_USD.

#first sort data by increasing salary_USD
salary_data2 %>%
  arrange(salary_USD) %>%
  ggplot(aes(x = row_number(salary_USD), y = salary_USD)) +
  geom_point() +
  labs(x = "Ordered Observations", y = "Salary (USD)")

There are five positive outliers, starting at approximately $1,600,000.

salary_USD is severely positively skewed, with what looks to be an exponentially increasing trend.

We can use a couple of different methods to gain more understanding of the extremity of the outliers.


Z-Score Method for Outlier Detection

Z-score method of identifying outliers: values further than 3 standard deviations from the mean.

#Convert values to z-scores and identify those with absolute value > 3
z_scores <- scale(salary_data2$salary_USD)
outliers_z_logical <- abs(z_scores) > 3

#Create a data-frame with the z-scores and associated USD salary values that are considered outliers
z_scores_outliers <- z_scores[outliers_z_logical]
salary_USD_outliers <- salary_data2$salary_USD[which(outliers_z_logical)]
outliers_z_df <- data.frame(salary_USD_outliers, z_scores_outliers)

#view dataframe (too long to print)
#outliers_z_df %>%
#  arrange(desc(salary_USD_outliers))

#identify the number of outliers
paste("number of outliers: ", length(z_scores_outliers))
## [1] "number of outliers:  176"
#identify the lowest value
paste("Lowest value: ", min(salary_USD_outliers))
## [1] "Lowest value:  302543"


According to this method, there are 176 outliers, all positive, some of which are massively beyond 3 standard deviations from the mean. Values greater than or equal to $302,543 are considered outliers.


IQR Method for Outlier Detection

Another approach, IQR method of identifying outliers: Points further than 1.5 * IQR from the median are considered outliers.

#obtain quantiles and IQR
Q1 <- quantile(salary_data2$salary_USD, 0.25)
Q3 <- quantile(salary_data2$salary_USD, 0.75)
IQR_value <- Q3 - Q1

#identify outliers using those values
outliers_iqr_logical <- (salary_data2$salary_USD < (Q1 - 1.5 * IQR_value)) | (salary_data2$salary_USD > (Q3 + 1.5 * IQR_value))

iqr_outliers <- salary_data2$salary_USD[which(outliers_iqr_logical)]

#View outliers, in ascending order (too large to print)
#sort(iqr_outliers, decreasing = FALSE)

#identify the number of outliers
paste("number of outliers: ", length(iqr_outliers))
## [1] "number of outliers:  1123"
#identify the lowest value
paste("Lowest value: ", min(iqr_outliers))
## [1] "Lowest value:  187798"


There are 1123 resulting outliers — all positive - using this method.

Values greater than or equal to $187,798 are considered outliers, which is a more aggressive approach than the Z-score method used previously.


Comparing Sample Data with Population Data

There are a lot of observations with extremely high salary values. The potential for bias is high, as this is an anonymous self-selected survey over the internet. There are multiple salary values over $1,000,000 and many more residing close to this value.

After doing some online research, we can compare the proportion of high earners in this survey with the proportion of high earners in the general population, to get a general gauge of how likely we are to observe data this extreme, and then make a judgement about how realistic these values may be.

In the following table, the ‘population’ column represents an estimation, based on online research, for the proportion of the population with a salary level equal or greater to the assigned value, and the ‘sample’ value represents the proportion within this survey.

#create a function that takes threshold values and data frame, and returns proportion of data at least that large
proportion_func <- function(threshold, data){
  data %>%
    filter(salary_USD >= threshold) %>%
    summarize(proportion = signif(n() / nrow(data), 2))
}

#establish thresholds and calculate proportions
thresholds <- c(300000, 500000, 700000, 1000000, 1250000, 1500000, 2000000, 2500000, 3000000, 4000000)
proportions = sapply(thresholds, proportion_func, data = salary_data2)
names(proportions) = thresholds
library(knitr)

# Assign table columns (vectors) then collate together
thresh_values <- c("$300k", "$500k", "$700k", "$1m", "$1.25m", "$1.5m", "$2m", "$2.5m", "$3m", "$4m")
pop_values <- c(0.01, 0.002, 0.001, 0.0009, 0.0008, 0.0007, 0.0005, 0.0002, 0.0001, 0.0001)
samp_values <- as.vector(unlist(proportions))

df <- data.frame(salary = thresh_values, population = pop_values, sample = samp_values)

# Set options to display numbers in absolute terms
options(scipen = 999)

# Print table
kable(df, caption = "Estimated Proprtions of Population and Survey Sample over Specific Salary Thresholds")
Estimated Proprtions of Population and Survey Sample over Specific Salary Thresholds
salary population sample
$300k 0.0100 0.007400
$500k 0.0020 0.002300
$700k 0.0010 0.001300
$1m 0.0009 0.000580
$1.25m 0.0008 0.000400
$1.5m 0.0007 0.000180
$2m 0.0005 0.000110
$2.5m 0.0002 0.000073
$3m 0.0001 0.000073
$4m 0.0001 0.000036
options(scipen = 0)

The table suggests that the survey data seems to be within reasonable expectation.


Transformation and Further Outlier Treatment


Logarithm transformation

Income based data often features a distinct positive skew.

Due to extreme positive skew in our salary_USD variable, we can observe the effect of a log transformation.

#log transformation of salary
salary_USD_log = log(salary_data2$salary_USD)
data = salary_USD_log

#histogram
par(mfrow = c(1,2))
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "Salary (USD)", main = "Log Transformed Salary")

#boxplot
boxplot(data, col = "skyblue", xlab = "Salary (USD)", main = "Log Transformed Salary")

A more symmetric distribution is obtained (although outliers are still present).


Combination of Logarithm Transformation and Various degrees of Outlier Trimming

Comparison of side-by-side boxplots and histograms allows us to assess the affect of varying degrees of trimming outliers.

First with the un-transformed salary_USD variable.

#Esatablish new data-frames with differing degrees of outlier trimming
salary_data_tr <- salary_data2 %>%
  filter(salary_USD < 3000000, salary > 30000)

salary_data_tr2 <- salary_data2 %>%
  filter(salary_USD < 1500000, salary > 30000)

salary_data_tr3 <- salary_data2 %>%
  filter(salary_USD < 1000000, salary > 30000)

salary_data_tr4 <- salary_data2 %>%
  filter(salary_USD < 300000, salary > 30000)

salary_data_tr5 <- salary_data2 %>%
  filter(salary_USD < 190000, salary > 30000)

#Comparative salary boxplots (original variable, with progressively more outliers trimmed)
par(mfrow = c(2,3))
boxplot(salary_data2$salary_USD, col = "lightblue", main = "No additional trimming")
boxplot(salary_data_tr$salary_USD, col = "lightblue", main = "salary < $3,000,000")
boxplot(salary_data_tr2$salary_USD, col = "lightblue", main = "salary < $1,500,000")
boxplot(salary_data_tr3$salary_USD, col = "lightblue", main = "salary < $1,000,000")
boxplot(salary_data_tr4$salary_USD, col = "lightblue", main = "salary < $300,000")
boxplot(salary_data_tr5$salary_USD, col = "lightblue", main = "salary < $190,000")

par(mfrow = c(2,3))

#Comparative salary histograms (original variable, with progressively more outliers trimmed)
data = salary_data2$salary_USD
breaks = seq(min(data), max(data), length.out = 500)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "No additional trimming")

data = salary_data_tr$salary_USD
breaks = seq(min(data), max(data), length.out = 400)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $3,000,000")

data = salary_data_tr2$salary_USD
breaks = seq(min(data), max(data), length.out = 300)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $1,500,000")

data = salary_data_tr3$salary_USD
breaks = seq(min(data), max(data), length.out = 100)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $1,000,000")

data = salary_data_tr4$salary_USD
breaks = seq(min(data), max(data), length.out = 60)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $300,000")

data = salary_data_tr5$salary_USD
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", xlab = "salary ($)", main = "salary < $190,000")


Then we compare with the log-transformed distribution, with varying degrees of outliers trimmed.

#log transform each trimmed salary vector
salary_USD_tr_log = log(salary_data_tr$salary_USD)
salary_USD_tr2_log = log(salary_data_tr2$salary_USD)
salary_USD_tr3_log = log(salary_data_tr3$salary_USD)
salary_USD_tr4_log = log(salary_data_tr4$salary_USD)
salary_USD_tr5_log = log(salary_data_tr5$salary_USD)

#plot boxplots
par(mfrow = c(2,3))
boxplot(salary_USD_log, col = "lightblue", main = "log(salary)")
boxplot(salary_USD_tr_log, col = "lightblue", main = "log(salary < $3,000,000)")
boxplot(salary_USD_tr2_log, col = "lightblue", main = "log(salary < $1,500,000)")
boxplot(salary_USD_tr3_log, col = "lightblue", main = "log(salary < $1,000,000)")
boxplot(salary_USD_tr4_log, col = "lightblue", main = "log(salary < $300,000)")
boxplot(salary_USD_tr5_log, col = "lightblue", main = "log(salary < $190,000)")

par(mfrow = c(2,3))

data = salary_USD_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary)")

data = salary_USD_tr_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $3,000,000)")

data = salary_USD_tr2_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $1,500,000)")

data = salary_USD_tr3_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $1,000,000)")

data = salary_USD_tr4_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $300,000)")

data = salary_USD_tr5_log
breaks = seq(min(data), max(data), length.out = 30)
hist(data, breaks = breaks, col = "#007bc2", border = "white", xlab = "salary ($)", main = "log(salary < $190,000)")


Using the log transformed distribution with observations over $300,000 trimmed off, we can again determine potential outliers using the previous Z-score and IQR methods:

#Z-scores method
#Convert values to z-scores and identify those with absolute value > 3
z_scores2 <- scale(salary_USD_tr4_log)
outliers_z_logical2 <- abs(z_scores2) > 3

#Create a data-frame with the z-scores and associated USD salary values that are considered outliers
z_scores_outliers2 <- z_scores2[outliers_z_logical2]
salary_USD_log_outliers2 <- salary_USD_tr4_log[which(outliers_z_logical2)]
salary_USD_outliers2 <- salary_data_tr4$salary_USD[which(outliers_z_logical2)]
outliers_z_df2 <- data.frame(salary_USD_outliers2, salary_USD_log_outliers2, z_scores_outliers2)

#view dataframe
outliers_z_df2 %>%
  arrange(desc(salary_USD_outliers2))
##   salary_USD_outliers2 salary_USD_log_outliers2 z_scores_outliers2
## 1                19065                 9.855609          -3.123233
## 2                16578                 9.715832          -3.430592
## 3                14920                 9.610458          -3.662300
## 4                14453                 9.578657          -3.732227
#obtain quantiles and IQR
Q1 <- quantile(salary_USD_tr4_log, 0.25)
Q3 <- quantile(salary_USD_tr4_log, 0.75)
IQR_value <- Q3 - Q1

#identify outliers using those values
outliers_iqr_logical <- (salary_USD_tr4_log < (Q1 - 1.5 * IQR_value)) | (salary_USD_tr4_log > (Q3 + 1.5 * IQR_value))

#Create a data-frame with the outliers for both log transformed and non-transformed values
iqr_outliers_log_2 <- salary_USD_tr4_log[which(outliers_iqr_logical)]
iqr_outliers_2 <- salary_data_tr4$salary_USD[which(outliers_iqr_logical)]
iqr_outliers_df2 <- data.frame(iqr_outliers_2, iqr_outliers_log_2)

#view dataframe
iqr_outliers_df2 %>%
  arrange(desc(iqr_outliers_2))
##    iqr_outliers_2 iqr_outliers_log_2
## 1          295000          12.594731
## 2          295000          12.594731
## 3          295000          12.594731
## 4          295000          12.594731
## 5          292000          12.584509
## 6          290000          12.577636
## 7           19065           9.855609
## 8           16578           9.715832
## 9           14920           9.610458
## 10          14453           9.578657


Results


By observing the boxplots and histograms for the regular salary_USD variable, it is clear that there is a very high degree of positive skew, with a high frequency of positive outliers. Each degree of trimming outliers tends to progressively increase normality.

The distribution with values exceeding $300,000 trimmed off (salary_data_tr4) appears to display an improvement to normality and reduced outlier frequency while offering a bit more constraint versus the more aggressive approach of trimming values exceeding $190,000. The Z-score method of outlier detection supports this approach.

This distribution results in the following summary statistics, and density plot:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14453   56000   76000   87763  108000  295000
count average std_err median std_dev
26342 87763 266 76000 43150


The distribution appears relatively normal with significant positive skew. There is a very large range with the smallest value of $14,453, equating to just under 5% of the largest value, $295,000. In comparison to the median salary value, $76,000, the mean ($87,763) is being pulled in a positive direction due to the positive skew present. The IQR also reflects this.

Income-based data often exhibits a positive skew for several reasons:

  • Income Inequality: Many societies have significant income inequality, with a relatively small number of individuals or households earning very high incomes, while the majority earn lower incomes. This inequality causes the income distribution to be positively skewed, with a long right tail.

  • Capped Lower Bound: In many cases, there is a minimum income threshold below which individuals cannot earn less. This creates a lower bound or floor effect, resulting in a concentration of observations at the lower end of the income scale, leading to positive skewness.

  • Diminishing Returns: As income increases, the marginal utility of additional income tends to decrease. This means that while some individuals may earn very high incomes, the rate at which their income grows may slow down, contributing to a positively skewed distribution.

  • Outliers: Incomes at the very high end of the scale, such as those of top executives, celebrities, or entrepreneurs, can be extreme outliers and contribute to the positive skew of the distribution.

Therefore, a log-transformation of salary_USD seems like a very suitable measure to reduce skew and aid the assumptions required for upcoming inferential testing.

We see that the log-transformation does in-fact seem to significantly improve normality.

Again, the distribution with no values exceeding $300,000, salary_USD_tr4_log, seems to display a healthy balance of trimmed outliers:.

data <- data.frame(salary = salary_USD_tr4_log)
ggplot(data, aes(x = salary)) + 
  geom_density(alpha = 0.4, linewidth = 0.1, fill = "blue", color = "blue") +
  labs(title = bquote(bold("log(Salary < $300,000)")), x = "log(Salary (USD))", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16))


The plots indicate that we only have a few potential outliers remaining in this distribution.

Using the Z-scores method, there are only four observations that meet the criteria of being outliers (this is significantly less than 134 outliers - the number of outliers remaining with the previous iteration of outlier trimming (< $1,000,000).

Using the IQR method, the same four observations meet the criteria of being negative outliers, and we also have six positive outliers (again, this is significantly less than 164 outliers - the number of outliers remaining with the previous iteration of outlier trimming (< $1,000,000).

Therefore, we may consider this distribution (salary_USD_tr4_log) when proceeding with some of our analysis methods that require normality and homoscedasticity.


Factor Variables


Exploratory Analysis


Bar graphs displaying distributions of factor variables.

# Calculate counts for each level then create a bar graph of the counts

education_counts <- table(salary_data_tr4$education)
ggplot(data = NULL, aes(x = factor(names(education_counts), 
                                   levels = levels(salary_data_tr4$education)), y = education_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = education_counts), 
             vjust = ifelse(education_counts > 0.90 * max(education_counts), 1.5, -0.5), size = 3) +
  theme_classic() +
  labs(x = "Education", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Education"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7), axis.title.x = element_text(margin = margin(t = 10), hjust = 0.5)) +
  annotate("rect", xmin = 5.9, xmax = 7.1, ymin = 11700, ymax = 12300,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 6.5, y = 12000, label = "Box value = count", size = 3) 

work_exp_overall_counts <- table(salary_data_tr4$work_exp_overall)
ggplot(data = NULL, aes(x = factor(names(work_exp_overall_counts), 
                                   levels = levels(salary_data_tr4$work_exp_overall)), y =work_exp_overall_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = work_exp_overall_counts), 
             vjust = ifelse(work_exp_overall_counts > 0.90 * max(work_exp_overall_counts), 1.5, -0.5), size = 3) +
  theme_classic() +
  labs(x = "Work Experience (Overall)", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Overall Work Experience"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 6.7, xmax = 8.3, ymin = 9700, ymax = 10300,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 7.5, y = 10000, label = "Box value = count", size = 3) 

work_exp_field_counts <- table(salary_data_tr4$work_exp_field)
ggplot(data = NULL, aes(x = factor(names(work_exp_field_counts), 
                                   levels = levels(salary_data_tr4$work_exp_field)), y = work_exp_field_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = work_exp_field_counts), 
             vjust = ifelse(work_exp_field_counts > 0.90 * max(work_exp_field_counts), 1.5, -0.5), size = 3) +
  theme_classic() +
  labs(x = "Work Experience (in field)", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Work Experience in Field"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 6.8, xmax = 8.2, ymin = 5800, ymax = 6200,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 7.5, y = 6000, label = "Box value = count", size = 3) 

age_group_counts <- table(salary_data_tr4$age_group)
ggplot(data = NULL, aes(x = factor(names(age_group_counts), 
                                   levels = levels(salary_data_tr4$age_group)), y = age_group_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = age_group_counts), 
             vjust = ifelse(age_group_counts > 0.90 * max(age_group_counts), 1.5, -0.5), size = 3) +
  theme_classic() +
  labs(x = "Age", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Age Group"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7.5), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 5.9, xmax = 7.1, ymin = 11700, ymax = 12300,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 6.5, y = 12000, label = "Box value = count", size = 3) 

gender_counts <- table(salary_data_tr4$gender)
ggplot(data = NULL, aes(x = factor(names(gender_counts), levels = c("Man", "Woman", "Non-binary", "Other or prefer not to answer")), y = gender_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = gender_counts), vjust = ifelse(gender_counts > 0.40 * max(gender_counts), 1.5, -0.5), size = 3) +  # Label the bars with counts
  theme_classic() +
  labs(x = "Gender", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Gender"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 3.6, xmax = 4.4, ymin = 18500, ymax = 19500,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 4, y = 19000, label = "Box value = count", size = 3) 

currency_counts <- table(salary_data_tr4$currency)
ggplot(data = NULL, aes(x = names(currency_counts), y = currency_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = currency_counts), vjust = ifelse(currency_counts > 0.40 * max(currency_counts), 1.5, -0.5), size = 3) +  # Label the bars with counts
  theme_classic() +
  labs(x = "Currency", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Currency"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 1, xmax = 3, ymin = 20500, ymax = 21500,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 2, y = 21000, label = "Box value = count", size = 3) 

region_counts <- table(salary_data_tr4$region)
threshold <- 0.20 * max(region_counts)
ggplot(data = NULL, aes(x = names(region_counts), y = region_counts)) +
  geom_col(fill = "#007bc2") +
  geom_label(aes(label = region_counts, 
                 y = ifelse(region_counts > threshold, region_counts - threshold * 0.8, region_counts)), 
                 vjust = -0.5, size = 2.5, color = "black") + 
    theme_classic() +
  labs(x = "Region", y = "Count", title = bquote("Number of " ~ bold("Respondents") ~ " by " ~ bold("Region"))) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(margin = margin(t = 10))) +
  annotate("rect", xmin = 10, xmax = 12, ymin = 21000, ymax = 23000,
           fill = "white", color = "black", linetype = "solid", linewidth = 0.2) +
  annotate("text", x = 11, y = 22000, label = "Box value = count", size = 3) 

# Display plots side by side
#library(gridExtra)
#grid.arrange(plot_gender, plot_currency, ncol = 2, widths = c(0.8, 0.8))


Contingency tables for factor variables with many levels:

industry_counts <- salary_data_tr4 %>%
  group_by(industry) %>%
  count() %>%
  filter(n > 500) %>%
  arrange(desc(n))

kable(industry_counts, caption = "Distribution of factor: industry (n > 500 only)")
Distribution of factor: industry (n > 500 only)
industry n
Computing or Tech 4477
Education (Higher Education) 2313
Nonprofits 2308
Government and Public Administration 1808
Health care 1785
Engineering or Manufacturing 1709
Accounting, Banking & Finance 1703
Marketing, Advertising & PR 1064
Law 1025
Other 861
Business or Consulting 812
Education (Primary/Secondary) 754
Media & Digital 734
Insurance 505
city_counts <- salary_data_tr4 %>%
  group_by(city) %>%
  count() %>%
  filter(n > 200) %>%
  arrange(desc(n))

kable(city_counts, caption = "Distribution of factor: city (n > 200 only)")
Distribution of factor: city (n > 200 only)
city n
Boston 759
Chicago 727
New York 684
Seattle 666
London 500
New York City 492
San Francisco 478
Los Angeles 447
Portland 414
Toronto 407
Austin 336
Washington 330
Denver 311
Atlanta 307
Minneapolis 306
Philadelphia 261
Houston 254
Washington, DC 246
Dallas 221
NYC 212
Vancouver 203
us_state_counts <- salary_data_tr4 %>%
  group_by(us_state) %>%
  count() %>%
  filter(n > 200) %>%
  arrange(desc(n))

kable(us_state_counts, caption = "Distribution of factor: us_state (n > 200 only)")
Distribution of factor: us_state (n > 200 only)
us_state n
4206
California 2522
New York 2104
Massachusetts 1478
Texas 1216
Illinois 1169
Washington 1143
District of Columbia 954
Pennsylvania 902
Virginia 752
Minnesota 709
Ohio 628
Colorado 615
Oregon 607
North Carolina 577
Maryland 547
Michigan 521
Georgia 510
Florida 494
Wisconsin 455
New Jersey 385
Missouri 324
Indiana 308
Arizona 294
Tennessee 273
Connecticut 223
race_main_counts <- salary_data_tr4 %>%
  group_by(race_main) %>%
  count() %>%
  arrange(desc(n))

kable(race_main_counts, caption = "Distribution of factor: race_main")
Distribution of factor: race_main
race_main n
White 21877
Asian or Asian American 1307
Another option not listed here or prefer not to answer 711
Black or African American 630
Hispanic, Latino, or Spanish origin 553
Hispanic, Latino, or Spanish origin, White 362
Asian or Asian American, White 331
Other 158
Black or African American, White 113
Middle Eastern or Northern African, White 73
Native American or Alaska Native, White 65
Middle Eastern or Northern African 64
White, Another option not listed here or prefer not to answer 59
Native American or Alaska Native 39


Results


Education, work experience and age appear to be somewhat normally distributed.

The majority of respondents have a college degree as their highest form of education. A Master’s degree is also fairly popular, while the remainder of options are all much less frequent.

For overall work experience, note that the band widths for each category are not equal, affecting our ability to make fair comparisons. Work experience of 11 - 20 years is the most frequent option, with 8 - 10 years experience coming in second, however, it has a much smaller bandwidth. Frequency progressively decreases towards the tails. As expected, very few respondents have worked for 41 or more years.

Work experience in one’s particular field is more evenly distributed over the categories ranging from 2 years to 20 years. Again, frequency decreases towards the tails.

When observing the age group graph, we again notice some inconsistencies between the bandwidths of categories. The 25 - 34 category is the most frequent, with 35 - 44 a bit less so and then a fairly large decrease to the next greatest, 55 - 64.

Interestingly, the gender distribution is significantly weighted towards Women, with approximately four times more respondents than men.

USD is the most common currency by a significant margin. Canadian dollar, Great Britain pound, Euro and Australia/NZ dollar are the next most common, with the remainder of currencies being very infrequent.

Region is almost completely dominated by Northern America, with the next most common being Northern Europe.

In industry, computing or technology is considerably more frequent than the rest. White is by far the most frequent race response.




Bivariate Analysis


Salary by Region


We are interested in exploring whether there is any correlation between a person’s global region and annual salaries. Initially we can perform an exploratory analysis of the data to look for patterns and trends. Following this, we will perform some inferential tests to gain more statistical insight.


Exploratory Analysis


Salary by region statistics (ordered by median salary)

#Alternative method:
#library(psych)
#region_summaries <- describeBy(salary_USD ~ region, data = salary_data_tr4, mat = TRUE, digits = 2)

#create a summary statistics table for region factor
region_summaries <- salary_data_tr4 %>%
  group_by(region) %>%
  summarize(count = n(),
            average = round(mean(salary_USD)),
            std_err = round(sd(salary_USD) / sqrt(count)),
            median = round(median(salary_USD)),
            std_err_median = round(((quantile(salary_USD, 0.75) - quantile(salary_USD, 0.25)) / sqrt(n())) * 1.253),
            var = round(var(salary_USD)),
            std_dev = round(sd(salary_USD))
            ) %>%
  arrange(desc(median))

#Produce a nicely formatted table
kable(region_summaries)
region count average std_err median std_err_median var std_dev
Western Asia 13 101430 15365 95000 10773 3068961377 55398
Eastern Europe 8 97225 16153 79317 25041 2087459753 45689
Western Europe 375 86636 1907 78512 2707 1363248486 36922
Central America, South America & Caribbean 21 75186 7244 78000 9843 1102010286 33197
Northern America 24278 88496 279 77000 426 1895642514 43539
NA 71 81929 4660 75000 8108 1541693186 39264
Southern Asia 7 83538 15653 70000 20337 1715015103 41413
Northern Europe 1328 77730 993 66615 1307 1310439000 36200
Australia and New Zealand 127 72741 3064 65007 4021 1192658455 34535
Africa 27 67507 6917 65000 10784 1291894626 35943
Southern Europe 41 77216 7043 63032 8041 2033738776 45097
South-Eastern Asia 12 85667 17201 53750 22110 3550560606 59587
Eastern Asia 34 67964 7688 49040 8413 2009455837 44827


Median salary per region compared to global median.

global_median = median(salary_data_tr4$salary_USD)

salary_data_tr4 %>%
  group_by(region) %>%
  summarise(med = median(salary_USD)) %>%
  filter(region != 'NA') %>%
  ggplot(aes(x = med, y = reorder(region, med), color = med)) +
  geom_point(size = 4) +
    geom_segment(aes(xend = 40000, yend = region), linewidth = 2) +
    geom_text(aes(label = round(med / 1000)), color = "white", size = 2.5) +
    scale_x_continuous("", expand = c(0,0), limits = c(40000,100000), position = "top", labels = scales::comma_format(scale = 1e-3)) +
    scale_color_gradient(low = "#F08080", high = "#007bc2", guide = "legend") +
    theme_classic() +
    theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(margin = margin(t = 10, b = 10)),
        plot.margin = margin(l = 10, r = 20)
        ) +
    labs(title = bquote(bold("Median Salary ") ~ "(in US$1000) per " ~ bold("Region"))) +
    geom_vline(xintercept = global_median, color = "grey40", linetype = 3) +
    annotate("text", x = global_median + 5000, y = 5.5, label = "The\nglobal\nmedian", 
             vjust = 1, size = 3, color = "grey40") +
    annotate("curve", x = global_median + 5000, y = 5.6, xend = global_median + 1000, yend = 7,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"), color = "grey40")


Median salary for each region with associated standard error bars and count values, ranked in descending order of median value

#filter out NA values first
salary_data_tr4_noNA <- salary_data_tr4 %>%
  filter(!is.na(region))

# Calculate the median salary and associated SE estimates for each region
median_salaries <- salary_data_tr4_noNA %>%
  group_by(region) %>%
  summarise(median_salary = median(salary_USD),
            se_med_salary = ((quantile(salary_USD, 0.75) - quantile(salary_USD, 0.25)) / sqrt(n())) * 1.253,
            count = n())

#plot bar graphs
ggplot(median_salaries, aes(x = reorder(region, -median_salary), y = median_salary)) +
  geom_col(fill = "#007bc2") +
  geom_errorbar(aes(ymin = median_salary - se_med_salary, ymax = median_salary + se_med_salary), width = 0.1) +
  geom_label(aes(x = region, y = 7000, label = count), size = 2.5) +
  theme_classic() +
  labs(x = "Region", y = "Median Salary (USD)", title = bquote(bold("Median Salary ") ~ "per " ~ bold("Region") ~ " (with SE bars)")) +
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10), plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5))


Side by side boxplots of salary by region (ordered by median salary)

ggplot(salary_data_tr4_noNA, aes(y = salary_USD, x = reorder(region, -salary_USD, FUN = median))) +
  geom_boxplot(fill = "blue", color = "black", width = 0.8, alpha = 0.2) +
  labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Region", y = "Salary (USD)") +
  theme_classic() +
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1), axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10), plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5, size = 16))


Side by side dot plots of salary by region (ordered by median salary)

ggplot(salary_data_tr4_noNA, aes(x = salary_USD, y = reorder(region, salary_USD, FUN = median))) + 
  geom_point(alpha = 0.3, fill = "blue", colour = "blue") +
  labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Salary (USD)", y = "Region") +
  theme(plot.title = element_text(margin = margin(t = 10, b = 10), size = 16))


Side-by-side violin plots of salary by region

ggplot(data = salary_data_tr4_noNA, aes(x = reorder(region, salary_USD, FUN = median), y = salary_USD, fill = region)) +
  geom_violin(alpha = .25, color = "grey") +
  #optional points superimposed on plots
  #geom_jitter(alpha=.10, aes(color=region), position=position_jitter(width=0.3)) +
  #geom_label(aes(x = 0, y = region, label = count), size = 2.5) +
  labs(title = bquote(bold("Salary ") ~ "by " ~ bold("Region")), x = "Region", y = "Salary (USD)") +
  #scale_fill_manual(values=cbPalette) +
  scale_fill_brewer(palette = "Paired") +
  coord_flip() +
  theme_classic() +
  theme(plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5, size = 16), legend.position = "none")


Overlapping density plots for salary by region.

ggplot(salary_data_tr4_noNA, aes(x = salary_USD, fill = region, color = region)) + 
  geom_density(alpha = 0.3, linewidth = 0.1) +
  labs(title = bquote(bold("Salary") ~ " by" ~ bold("Region")), x = "Salary (USD)", y = "Density") +
  theme_classic() +
  theme(legend.title = element_blank(), legend.position = "top", legend.text = element_text(size = 8),
    legend.key.width = unit(0.5, "cm"), legend.key.height = unit(0.5, "cm"), legend.background = element_rect(fill = "grey90", color = "grey"), plot.title = element_text(hjust = 0.5, size = 16))


Comparison density plots of salary by region.

ggplot(salary_data_tr4_noNA, aes(x = salary_USD, fill = region, color = region)) + 
  geom_density(alpha = 0.4, linewidth = 0.1) +
  geom_vline(data = median_salaries, aes(xintercept = median_salary), color = "red", linetype = "dashed") +
  facet_wrap(~ region, ncol = 3) + # Separate plots for each level of 'region'
  labs(title = bquote(bold("Salary ") ~ "by" ~ bold("Region") ~ "(with median indicators)"), x = "Salary (USD)", y = "Density") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.3, size = 14))


Side by side dot plots of salary by region for top earners only (> $400,000):

salary_data2 %>%
  filter(salary_USD > 400000) %>%
  ggplot(aes(x = salary_USD, y = reorder(region, salary_USD, FUN = max))) + 
   geom_point(alpha = 0.3, fill = "blue", colour = "blue") +
    labs(title = bquote(bold("Salary ") ~ "(USD) by " ~ bold("Region") ~ " (top earners only)"), x = "Salary (USD)", y = "Region") +
    theme(plot.title = element_text(margin = margin(t = 10, b = 10), hjust = 0.5))


Inferential Testing


Initial ANOVA

Investigate the significance of differences in mean salary values between regions with an initial ANOVA test.

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

anova_region <- aov(salary_USD ~ region, data = salary_data_tr4_noNA)
summary(anova_region)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## region         11 2.114e+11 1.922e+10   10.36 <2e-16 ***
## Residuals   26259 4.872e+13 1.856e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Effect Size Calculation:

# Extracting sums of squares
SS_Between <- anova(anova_region)$Sum[1]
SS_Within <- anova(anova_region)$Sum[2]
SS_Total <- SS_Between + SS_Within

# Calculating effect size measures
eta_squared <- SS_Between / SS_Total

# Displaying effect size measures
cat("Eta-squared:", eta_squared)
## Eta-squared: 0.004321007

Statistically significant result (p-value < 0.000001 (6dp)). However, the exploratory analysis suggested violations of the ANOVA assumptions of normality and homoscedasticity.

Exploration required.


Diagnostics

Perform some in-depth diagnostics to check whether the assumptions for ANOVA are met.

plot(anova_region)


A boxplot of the residuals

ggplot() +
  geom_boxplot(aes(y = anova_region$res), fill = "skyblue") +
  labs(title = "Boxplot of inital ANOVA model residuals", y = "salary (USD)") +
  theme_minimal()


The Q-Q residuals plot suggests a great deal of positive skew and therefore violation of normality. The box plot of residuals reinforces this trend.

The Residuals vs Fitted plot affirms this with the positive residuals having more spread than the negative residuals. The plot also indicates that there are some differences in variance of residuals across groups. In particular, Eastern Europe, Southern Asia and Central/South America/Caribbeans seem to have particularly less variance than Northern America, Northern Europe and Western Europe (note there is likely a sample size effect present as the latter groups have the greatest number of observations in the dataset, while the former groups have the least).

In-fact, it is very likely that the statistical power of the ANOVA is being severely affected by the weighting of the largest group, Northern America. When observing the exploratory graph below, we notice that the global mean is almost identical to the Northern America mean, reinforcing this region’s overall influence due to dominating the sample size.

It might be necessary to remove those regions with very small sample sizes as this inconsistency in sample sizes interrupts the viability of our results.

global_mean = mean(salary_data_tr4_noNA$salary_USD)

salary_data_tr4 %>%
  group_by(region) %>%
  summarise(avg = mean(salary_USD)) %>%
  filter(region != 'NA') %>%
  ggplot(aes(x = avg, y = reorder(region, avg), color = avg)) +
    geom_point(size = 4) +
    geom_segment(aes(xend = 50000, yend = region), linewidth = 2) +
    geom_text(aes(label = round(avg / 1000)), color = "white", size = 2.5) +
    scale_x_continuous("", expand = c(0,0), limits = c(50000,105000), position = "top") +
    scale_color_gradient(low = "#F08080", high = "skyblue", guide = "legend") +
    theme_classic() +
    theme(axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text = element_text(color = "black"),
        axis.title = element_blank(),
        legend.position = "none",
        plot.title = element_text(margin = margin(t = 10, b = 10)),
        plot.margin = margin(l = 10, r = 20)
        ) +
    labs(title = bquote(bold("Average Salary ") ~ " (in US$1000) per " ~ bold("Region"))) +
    geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
    annotate("text", x = global_mean + 5000, y = 5.5, label = "The\nglobal\naverage", 
             vjust = 1, size = 3, color = "grey40") +
    annotate("curve", x = global_mean + 5000, y = 5.6, xend = global_mean + 1000, yend = 7,
    arrow = arrow(length = unit(0.2, "cm"), type = "closed"), color = "grey40")


Normality Tests

Our sample size is too large to perform some of the popular normality tests, e.g. Shapiro-Wilk. We can attempt some others:

#Kolmogorov-Smirinov Test
ks.test(salary_data_tr4_noNA$salary_USD, "pnorm", mean = mean(salary_data_tr4_noNA$salary_USD), sd = sd(salary_data_tr4_noNA$salary_USD))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  salary_data_tr4_noNA$salary_USD
## D = 0.1167, p-value < 2.2e-16
## alternative hypothesis: two-sided
#Jarque-Bera test for normality
library(tseries)
jarque.bera.test(salary_data_tr4_noNA$salary_USD)
## 
##  Jarque Bera Test
## 
## data:  salary_data_tr4_noNA$salary_USD
## X-squared = 12962, df = 2, p-value < 2.2e-16


Both the Kolmogorov-Smirinov test (D = 0.1167, p-value < 2.2e-16) and the Jarque-Bera test (X-squared = 12962, p-value < 2.2e-16) results are statistically significant, indicating we reject the null hypothesis that we have a normally distributed variable. Keep in mind, with a very large sample size (n = 26271), we generally have increased statistical power, meaning we detect very small deviations from normality with greater precision, even if the effect size is of very little practical importance.


Equal Variances Tests

Rule of thumb check for equal variances by calculating the ratio between the largest variance (South Eastern Asia: 3,550,560,606) and the smallest variance (Central/South America/Caribbean: 1,102,010,286)

F_stat = max(region_summaries$var) / min(region_summaries$var)
F_stat
## [1] 3.221894


The resulting value is slightly under the rule of thumb threshold of 4, indicating variances may be similar enough to satisfy the assumptions of ANOVA.

We can conduct some more general tests for homoscedasticity:

Hartley’s Test (F-test ratio check of variance).

index_1 <- which(salary_data_tr4$region == "South-Eastern Asia")
index_2 <- which(salary_data_tr4$region == "Central America, South America & Caribbean")
var.test(salary_data_tr4$salary_USD[index_1], salary_data_tr4$salary_USD[index_2])
## 
##  F test to compare two variances
## 
## data:  salary_data_tr4$salary_USD[index_1] and salary_data_tr4$salary_USD[index_2]
## F = 3.2219, num df = 11, denom df = 20, p-value = 0.02237
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   1.184145 10.394297
## sample estimates:
## ratio of variances 
##           3.221894


Bartlett’s Test

bartlett.test(salary_USD ~ region, data=salary_data_tr4_noNA)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  salary_USD by region
## Bartlett's K-squared = 110.8, df = 11, p-value < 2.2e-16


Levene Test

leveneTest(salary_USD ~ as.factor(region), data = salary_data_tr4_noNA)
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group    11  8.6269 1.952e-15 ***
##       26259                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Hartley’s test returns a statistically significant result (F = 3.222 (3dp), num df = 11, denom df = 20, p-value = 0.02237). At the 5% level, we can reject the null hypothesis that the true ratio of variances is equal to 1. Note, there is a very large 95% CI for the F-stat ratio, as we have relatively small sample sizes for the groups concerned in the test. The CI almost contains a ratio value of 1.

Bartlett’s test may be more appropriate as we have more than two groups. We get a statistically significant result (K-squared = 110.8, df = 11, p-value < 2.2e-16), suggesting to reject the assumption of equal variances. However, this test is very sensitive to the normality assumption being satisfied.

Levene’s test, a special case of the Brown-Forsythe test, is less sensitive to the normality assumption than Bartlett and Hartley’s test. This is perhaps a more suitable choice. Again, high statistical significance (F = 8.6269, df = 11, p-value = 1.952e-15) suggesting unequal variances.


Conclusion

We have enough evidence to be concerned about both the assumptions of normality and homoscedasticty being violated.

Although, due to the very large sample size, the degree of violation to normality can be somewhat overlooked, we discovered earlier, a log transformation of salary_USD will significantly improve normality and homoscedasticty and is worth proceeding with.

Additionally, to improve the overall equality of sample sizes, and increase statistical power, we will remove some of the smaller region groups.


Sensitivity Analysis

We will perform a sensitivity analysis by initially applying a standard ANOVA model to the transformed dataset, and then iterating through various alternative models as required.

First, transform and mutate the data.

#Update trimmed dataframe with the log-transformed salary_USD variable and rename
salary_data_tr4 <- cbind(salary_data_tr4, salary_USD_tr4_log)
names(salary_data_tr4)[ncol(salary_data_tr4)] = "salary_USD_log"

#extract vector of names of regions to remove
names_reg <- salary_data_tr4 %>%
  group_by(region) %>%
  count() %>%
  filter(n < 30) %>%
  pull(region) %>%
  as.character()

#remove small region groups and remove NA group
salary_data_tr4_2 <- salary_data_tr4 %>%
  filter(!is.na(region) & !region %in% names_reg)


Run the ANOVA model on transformed data (and a corresponding linear model for more insight):

Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.

#ANOVA test output
anova_region2 <- aov(salary_USD_log ~ region, data = salary_data_tr4_2)
summary(anova_region2)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## region          5     22   4.374   21.28 <2e-16 ***
## Residuals   26177   5382   0.206                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#corresponding linear model output
lm_region2 <- lm(salary_USD_log ~ region, data = salary_data_tr4_2)
summary(lm_region2)
## 
## Call:
## lm(formula = salary_USD_log ~ region, data = salary_data_tr4_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23510 -0.33249 -0.03174  0.31581  1.35648 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            11.10585    0.04023 276.031  < 2e-16 ***
## regionEastern Asia     -0.13629    0.08755  -1.557   0.1196    
## regionNorthern America  0.17745    0.04034   4.399 1.09e-05 ***
## regionNorthern Europe   0.07216    0.04211   1.713   0.0866 .  
## regionSouthern Europe   0.02347    0.08144   0.288   0.7732    
## regionWestern Europe    0.18510    0.04655   3.976 7.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4534 on 26177 degrees of freedom
## Multiple R-squared:  0.004048,   Adjusted R-squared:  0.003857 
## F-statistic: 21.28 on 5 and 26177 DF,  p-value: < 2.2e-16


Effect Size Calculation:

# Extracting sums of squares
SS_Between <- anova(anova_region2)$Sum[1]
SS_Within <- anova(anova_region2)$Sum[2]
SS_Total <- SS_Between + SS_Within

# Calculating effect size measures
eta_squared <- SS_Between / SS_Total

# Displaying effect size measures
cat("Eta-squared:", eta_squared)
## Eta-squared: 0.0040477


Alternatives to standard ANOVA

We can attempt some other alternatives as a point of comparison.


Unequal Variances (Welch’s) ANOVA:

Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.

anova_region_welch <- oneway.test(salary_USD_log ~ region, data = salary_data_tr4_2)
anova_region_welch
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  salary_USD_log and region
## F = 25.36, num df = 5.00, denom df = 176.11, p-value < 2.2e-16


Non-parametric Kruskal-Wallis test:

Null Hypothesis: There is no significant difference in median logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in median logarithm of salaries across at least one pair of regions.

ggbetweenstats(data = salary_data_tr4_2, x = region, y = salary_USD_log, 
               type = "nonparametric", outlier.tagging = TRUE) + 
  labs(y = "log(salary (USD))", title = "Distributions of log(salary (USD)) by region with Kruskal-Walis and Pairwise tests")


Non-parametric Random Permutation Test (using un-transformed data):

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

#First assign the F-value from the previous regular anova model to a new object
anova_f <- summary(anova_region2)[[1]][1,4]

#establish the count for each region group (without NAs)
region_counts <- salary_data_tr4_2 %>%
  group_by(region) %>%
  summarise(n = n()) 
  
#assign a permutation function (data = data frame, n.vec = vector where each element represents the group sample sizes, R = number of resamples)
perm_fun <- function(data, R = 1000){
 
   #assign initial vectors and parameters
  mean_overall <- mean(data[,'salary_USD'])
  f_vec = numeric(R)
  n_vec = as.numeric(region_counts$n)
  n <- sum(n_vec)
 
   # Create names using paste0 and letters then assign names to the vector
  names_vec <- paste0("n", letters[seq_along(n_vec)])
  names(n_vec) <- names_vec
  
  #loop for each re-sample iteration
  for(r in 1:R){
    
    #assign new sum of squares within and between vectors
    ss_w = numeric(length(n_vec))
    ss_b = numeric(length(n_vec))
    
    #loop for each sample subsample of region
    for(k in 1:length(n_vec)){
      
      #draw random values from collated dataset
      index <- sample(1:n, n_vec[k])
      values <- data[index,'salary_USD']
      
      #cacl. the mean, and sum of squares for that random sample
      mean_group <- mean(values)
      ss_w[k] <- sum((values - mean_group)^2)
      ss_b[k] <- (mean_group - mean_overall)^2 * n_vec[k]
    }
 
  #calc. the mean ss and then f statistic for that complete permutation
  mss_w <- sum(ss_w) / (n - 1)
  mss_b <- sum(ss_b) / (length(n_vec) - 1)
  f_stat <- mss_b / mss_w
  
  #update the f_vec with this permutation's f_stat
  f_vec[r] = f_stat
  }
  
  #plot and print the results including p-value, 0.95 quantile (our threshold for significance) and the max value
  print(paste("0.95 quantile: ", round(quantile(f_vec, 0.95), 3)))
  print(paste("maximum F-value: ", round(max(f_vec), 3)))
  print(paste("observed F-value: ", anova_f))
  p_value <- mean(f_vec > anova_f)
  print(paste("p-value: ", p_value))
  
  #plot a histogram of the sampling distribution with lines representing the 0.05% significance level and the observed f-value
  hist(f_vec, main = "Empirical sampling distribution of F statistic\nfrom Random Permutation Test", xlab = "F value", breaks = 50, xlim = c(0,anova_f +5), prob = T)
  abline(v = anova_f, col = "red", lty = 2)
  text(anova_f, par("usr")[4] - 0.10, "Observed\nF-value", pos = 4, col = "red")
  abline(v = quantile(f_vec, 0.95), col = "blue", lty = 2)
  text(quantile(f_vec, 0.95), par("usr")[4] - 0.05, "0.95 quantile", pos = 4, col = "blue")
}

#Run function
perm_fun(data = salary_data_tr4_2, R = 1000)
## [1] "0.95 quantile:  2.127"
## [1] "maximum F-value:  4.956"
## [1] "observed F-value:  21.2774541989364"
## [1] "p-value:  0"

A benefit of the random permutation test is that we do not need to make assumptions about the sampling distribution of our test (F) statistic, in order to determine the probability of our observed group differences.

The plotted distribution essentially displays the distribution of the F statistic under a null hypothesis that there is no meaningful difference between salary in different regions. The blue lines represents the 0.95 quantile for the null distribution, which we are using as our significance level, and the red line represents the F value obtained from our sample. The observed F-value is far greater than the 0.95 quantile, and the probability of achieving this value, given the null hypothesis, is extremely small (p-value = 0).


Non-Parametric Bootstrap Test (using un-transformed data):

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

#first define the group counts and means
region_stats <- salary_data_tr4_2 %>%
  group_by(region) %>%
  summarise(n = n(), mean = mean(salary_USD))  

#Create boostrap function
boot_fun <- function(data, R = 1000){

  #'shift' data for each group so that it is 'centered' over 0, by subtracting the mean of each group from its values
  bootstrap_anova_data <- data %>%
    group_by(region) %>%
   mutate(mean_salary = mean(salary_USD),
         mean_diff = salary_USD - mean_salary) %>%
   ungroup() %>%
    select(mean_diff, region)

  bootstrap_anova_data <- as.data.frame(bootstrap_anova_data)
    
  #Set the initial bootstrap conditions and objects  
  f_vec = numeric(R)
  # Create a new vector 'names' by repeating each region 'n' times
  group_names <- rep(region_stats$region, times = region_stats$n)

  #Run the bootstrap sampling
  for(i in 1:R){
  
    #randomly sample with replacement from each group
    groupA = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Northern America", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Northern America", "n"]), replace = TRUE)
    groupB = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Australia and New Zealand", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Australia and New Zealand", "n"]), replace = TRUE)
    groupC = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Eastern Asia", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Eastern Asia", "n"]), replace = TRUE)
    groupD = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Northern Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Northern Europe", "n"]), replace = TRUE)
    groupE = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Southern Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Southern Europe", "n"]), replace = TRUE)
    groupF = sample(bootstrap_anova_data[bootstrap_anova_data$region == "Western Europe", "mean_diff"], size = as.numeric(region_stats[region_stats$region == "Western Europe", "n"]), replace = TRUE)
  
    #collate the data
    samples_collated = c(groupA, groupB, groupC, groupD, groupE, groupF) 
    collated_data = data.frame(samples_collated, group_names)
    f_vec[i] = oneway.test(samples_collated ~ group_names, data = collated_data, var.equal = T)$statistic
  }

  #plot and print the results including p-value, 0.95 quantile (our threshold for significance) and the max value
    print(paste("0.95 quantile: ", round(quantile(f_vec, 0.95), 3)))
    print(paste("maximum F-value: ", round(max(f_vec), 3)))
    print(paste("observed F-value: ", anova_f))
    p_value <- mean(f_vec > anova_f)
    print(paste("p-value: ", p_value))

  #create a histogram of the bootstapped sampling distribution with our F-value obtained from the original ANOVA and the 0.95 quantile
  hist(f_vec, main = "Empirical sampling distribution of F-statistic\nfrom bootstrap ANOVA", xlab = "F value", breaks = 50, xlim = c(0, anova_f + 5), prob = T)
  abline(v = anova_f, col = "red", lty = 2)
  text(anova_f, par("usr")[4] - 0.10, "Observed\nF-value", pos = 4, col = "red")
  abline(v = quantile(f_vec, 0.95), col = "blue", lty = 2)
  text(quantile(f_vec, 0.95), par("usr")[4] - 0.05, "0.95 quantile", pos = 4, col = "blue")
  
}

#run the function
boot_fun(data = salary_data_tr4_2, R = 1000)
## [1] "0.95 quantile:  2.013"
## [1] "maximum F-value:  3.758"
## [1] "observed F-value:  21.2774541989364"
## [1] "p-value:  0"

Like the random permutation test, the bootstrap test does not need to make assumptions about the sampling distribution of the test (F) statistic, in order to determine the probability of our observed group differences.

This test operates by creating a bootsampled F-statistic sampling distribution, where the data was first standardized by subtracting the group mean from each value in that group. The sampling distribution for F represents what we might observe if there were no true differences between groups.

The observed F-value is highly significant again (p-value = 0).


Diagnostics

We can assess the degree of normality and homoscedasticity for the logarithm transformed salary variable.


Normality Tests

#Kolmogorov-Smirinov Test
ks.test(salary_data_tr4_2$salary_USD_log, "pnorm", mean = mean(salary_data_tr4_2$salary_USD_log), sd = sd(salary_data_tr4_2$salary_USD_log))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  salary_data_tr4_2$salary_USD_log
## D = 0.038035, p-value < 2.2e-16
## alternative hypothesis: two-sided
#Jarque-Bera test for normality
jarque.bera.test(salary_data_tr4_2$salary_USD_log)
## 
##  Jarque Bera Test
## 
## data:  salary_data_tr4_2$salary_USD_log
## X-squared = 501.01, df = 2, p-value < 2.2e-16


Equal Variances Tests

#rule of thumb test
var(salary_data_tr4_2$salary_USD_log[which(salary_data_tr4_2$region == "Eastern Asia")]) / var(salary_data_tr4_2$salary_USD_log[which(salary_data_tr4_2$region == "Northern Europe")])
## [1] 1.916818
#Hartley's Test (F-test ratio check of variance)
index_1 <- which(salary_data_tr4_2$region == "Eastern Asia")
index_2 <- which(salary_data_tr4_2$region == "Northern Europe")
var.test(salary_data_tr4_2$salary_USD_log[index_1], salary_data_tr4_2$salary_USD_log[index_2])
## 
##  F test to compare two variances
## 
## data:  salary_data_tr4_2$salary_USD_log[index_1] and salary_data_tr4_2$salary_USD_log[index_2]
## F = 1.9168, num df = 33, denom df = 1327, p-value = 0.002876
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.237821 3.335939
## sample estimates:
## ratio of variances 
##           1.916818
#Bartlett's Test
bartlett.test(salary_USD_log ~ region, data=salary_data_tr4_2)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  salary_USD_log by region
## Bartlett's K-squared = 84.544, df = 5, p-value < 2.2e-16
#Levene Test
leveneTest(salary_USD_log ~ as.factor(region), data = salary_data_tr4_2)
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     5  19.527 < 2.2e-16 ***
##       26177                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Diagnostic Plots

#Diagnostic Plots
plot(anova_region2)


The Residuals vs Fitted plot suggests notable improvement in normality and homoscedasticity. The Q-Q Residuals plot indicates some improvement to normality however there is still some discrepancy. Instead of positive skew, as present before the transformation, there appears to be some degree of heavy tails.

Some of the tests for normality and equal variances also suggest some improvement, while some of them indicate the same degree of violation to these assumptions.

The density plots below illustrates somewhat improved normality distributions across regions.

# Calculate the median salary for each region
median_salaries <- salary_data_tr4_2 %>%
  group_by(region) %>%
  summarise(median_salary = median(salary_USD_log))

#plot
ggplot(salary_data_tr4_2, aes(x = salary_USD_log, fill = region, color = region)) + 
  geom_density(alpha = 0.4, linewidth = 0.1) +
  geom_vline(data = median_salaries, aes(xintercept = median_salary), color = "red", linetype = "dashed") +
  facet_wrap(~ region, ncol = 3) + # Separate plots for each level of 'region'
  labs(title = bquote(bold("Salary ") ~ "by" ~ bold("Region") ~ "(with median indicators)"), x = "log(Salary (USD))", y = "Density") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5, size = 16))


Post Hoc Individual Group Comparisons

Pairwise comparisons made to identify significantly different groups. Non-parametric Dunn’s test utilized, with Holm multiple test corrections for p-values.

Output displays statistically significant results only.

#perform Dunn's Test with Holm correction for p-values
post_hoc_table <- dunnTest(salary_USD ~ region, data = salary_data_tr4_2, method = "holm")

#create a refined table with only the significant comparisons and required elements displayed
post_hoc_table_refined <- as.data.frame(post_hoc_table[[2]]) %>%
  select(Comparison, P.adj) %>%
  filter(P.adj < 0.05) %>%
  rename("P-value" = P.adj) %>%
  arrange(desc(`P-value`)) %>%
  mutate(
    `Effect Size` = NA_real_,      
    `Median Difference (US$)` = NA_real_  
  )

#define vectors of salary for specific countries
n_a = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Northern America"]
a_nz = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Australia and New Zealand"]
e_a = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Eastern Asia"]
n_e = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Northern Europe"]
w_e = salary_data_tr4_2$salary_USD[salary_data_tr4_2$region == "Western Europe"]

#Calculate the Probability of Superiority (ps) for each significantly different group comparison and append to current table
post_hoc_table_refined[1,3] <- round(sum(outer(e_a, w_e, ">")) / (length(e_a) * length(w_e)),2)
post_hoc_table_refined[2,3] <- round(sum(outer(e_a, n_a, ">")) / (length(e_a) * length(n_a)),2)
post_hoc_table_refined[3,3] <- round(sum(outer(a_nz, w_e, ">")) / (length(a_nz) * length(w_e)),2)
post_hoc_table_refined[4,3] <- round(sum(outer(a_nz, n_a, ">")) / (length(a_nz) * length(n_a)),2)
post_hoc_table_refined[5,3] <- round(sum(outer(n_e, w_e, ">")) / (length(n_e) * length(w_e)),2)
post_hoc_table_refined[6,3] <- round(sum(outer(n_a, n_e, ">")) / (length(n_a) * length(n_e)),2)

#Calculate the absolute differences in medians of salary_USD between groups
post_hoc_table_refined[1,4] <- median(e_a) - median(w_e)
post_hoc_table_refined[2,4] <- median(e_a) - median(n_a)
post_hoc_table_refined[3,4] <- median(a_nz) - median(w_e)
post_hoc_table_refined[4,4] <- median(a_nz) - median(n_a)
post_hoc_table_refined[5,4] <- median(n_e) - median(w_e)
post_hoc_table_refined[6,4] <- median(n_a) - median(n_e)

#output formatted table
kable(post_hoc_table_refined)
Comparison P-value Effect Size Median Difference (US$)
Eastern Asia - Western Europe 0.0010674 0.29 -29472.5
Eastern Asia - Northern America 0.0010165 0.31 -27960.5
Australia and New Zealand - Western Europe 0.0002662 0.36 -13505.0
Australia and New Zealand - Northern America 0.0000714 0.38 -11993.0
Northern Europe - Western Europe 0.0000071 0.41 -11897.0
Northern America - Northern Europe 0.0000000 0.58 10385.0


Effect sizes help determine the practical significance of the statistically significant differences. The table presents Probability of Superiority (PS) values to quantify this.

PS values from the Dunn test indicate the probability that observations from one group are superior to those from another group, providing a measure of the magnitude of differences between groups. Values closer to the minimum (0) and maximum (1) values indicate larger effect sizes.


Results


Exploratory Analysis

Due to the significant positive skew within each salary distribution, we can refer to the median as a means of central tendency.

Western Asia and Eastern Europe exhibit the highest median salary values of $95,000 (SE = $10,773) and $79,317 (SE = $25,041), respectively. It’s worth noting that both groups have relatively small sample sizes (n = 13 and 8), which is reflected by their large SE values. Western Europe has the next highest median value ($78,512) with a larger sample size (n = 375) and smaller SE ($2,707).

Northern America, the most frequent region overall by a significant margin (n = 24,278—approximately 18x greater than the next most frequent) has the fifth greatest median salary, $77,000 (SE = $426). The bar graphs reveal that it is possible that Western Europe and Northern America actually have higher median salary values than Eastern Europe and Central America, South America & Caribbean due to the uncertainty in their estimates. Southern Asia also has a relatively imprecise estimate of population median (SE = $20,337).

Eastern Asia has the lowest median salary ($49,040, SE = $8,413) followed by South-Eastern Asia ($53,750, SE = $22,110) and Southern Europe ($63,032, SE = $8,041).

The side by side boxplots of salary by region reveal a number of positive outliers remaining in the dataset still. There’s a noticeable trend in that the number of outliers for a region seems proportional to its overall sample size, suggesting all regions may feature a similar distribution shape towards the tails. As a result, we notice the most outliers with Northern America, then Northern Europe, then Western Europe. Western Asia and Central America, South America & Caribbean appear to have negative skew, contrasting the rest of the distributions. This may be a result of their relatively small sample sizes (n = 13 and 21 respectively) not accurately representing the populations as a whole.

The side by side violin plots and density plots display the degree of positive skew present within each region. The density plots illustrate that there is a great deal of overlap between the distributions of each region. Those regions with larger sample sizes—most notably Northern America, Western Europe and Northern Europe—typically display more normal shaped distributions.


Inferential Testing / Sensitivity Analysis

We conducted an analysis of variance (ANOVA) to examine the differences in mean salaries across different global regions. The study aimed to investigate whether there are significant variations in salary levels among various geographic areas.

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

The ANOVA revealed a statistically significant difference in mean salaries among the regions, F(11, 26259) = 10.36, p < 0.000001 (6dp).

The effect size, as measured by eta-squared (η²), was calculated to be 0.004321 (6dp), indicating a small effect of region on salary variability.


We have enough evidence to be concerned about both the assumptions of normality and homoscedasticty being violated.

We performed a sensitivity analysis using alternative statistical methods. First we conducted an analysis of variance (ANOVA) with a log transformed response (and very small sample regions removed). The test investigated whether there are significant variations in logarithm transformed salary levels among various geographic areas.

Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.

The ANOVA revealed a statistically significant difference in mean logarithm of salaries among the regions, F(5, 26177) = 21.28, p < 0.000001 (6dp).

The effect size, as measured by eta-squared (η²), was calculated to be 0.004048 (6dp), indicating a small effect of region on logarithm of salary variability.


The exploratory graphs and diagnostic plots indicate improved satisfaction of the normality and equal variance assumptions from our initial ANOVA. Despite this improvement, some diagnostic tests continue to yield significant violation against normality and equal variances, possibly influenced by the large power persisting in these test. With very unbalanced sample sizes, any amount of heteroscedasticity is of concern.

Therefore we proceed with some alternative tests to take thee considerations into account.


Next we proceed with an Unequal Variances (Welch ANOVA) on our transformed data.

Null Hypothesis: There is no significant difference in mean logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean logarithm of salaries across at least one pair of regions.

Welch’s ANOVA revealed a statistically significant difference in mean logarithm of salaries among the regions, F(5, 176.11) = 25.36, p < 0.000001 (6dp).


Welch’s ANOVA result may still not be reliable due to the equal sample sizes assumption being considerably violated. We proceed with some non-parametric alternatives.

A Kruskal-Wallis test is conducted to examine the differences in median logarithm of salaries across different global regions.

Null Hypothesis: There is no significant difference in median logarithm of salaries across different regions.

Alternative Hypothesis: There is a significant difference in median logarithm of salaries across at least one pair of regions.

The Kruskal-Wallis test revealed a statistically significant difference in median logarithm of salaries among the regions, χ2(5) = 126.64, p < 0.000001 (6dp).


A Random Permutation Test using the un-transformed data is conducted.

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

The random permutation test revealed a statistically significant difference in mean salaries among the regions, Fcrit (alpha = 0.05) = 2.17, F = 21.28, p = 0 (6dp).


A Bootstrap Test using the un-transformed data is conducted.

Null Hypothesis: There is no significant difference in mean salaries across different regions.

Alternative Hypothesis: There is a significant difference in mean salaries across at least one pair of regions.

The bootstrap test revealed a statistically significant difference in mean salaries among the regions, Fcrit (alpha = 0.05) = 2.12, F = 21.28, p = 0 (6dp).


Individual Group Comparisons

For individual group comparisons, we can assess the results of the non-parametric Dunn’s pairwise tests, which is incorporated in our Kruskal-Walis test output. These tests utilise the Holm correction technique, to make adjustments for multiple comparisons. This control the family wise error rate and maintains the overall probability of making a Type I error (rejecting a true null hypothesis) across all pairwise comparisons at the desired alpha-level of 0.05.

Refer to the table above, which displays output for the six significant differences in median salaries between regions. Note that we obtain the same six significant differences when using a parametric alternative - Tukey’s Honest Significant Differences.

The table is ordered in descending order of practical and statistical significance. Probability of Superiority (PS) scores are used as effect sizes here. These represent the probability that any random value from one region will be greater than than any randomly selected value from the compared region. Therefore the effect size is considered larger as the PS deviates further from 0.5.

We observe the greatest effect sizes between Eastern Asia and Western Europe and Eastern Asia and Northern America. It seems that Eastern Asia has a significantly smaller salary rate in comparison to those regions.

Northern America has significantly higher salary rates than Northern Europe, and Australia and New Zealand.

Note that the groups with considerably high sample sizes, are generally the groups with enough power to detect a true difference in median salary.

Overall, the findings from the sensitivity analysis agree with those obtained from the original ANOVA, providing robust evidence of the influence of global region on the response variable, salary. The consistency of results across different statistical methods enhances our confidence in the validity of the conclusions drawn from this study.


Discussion


This analysis aimed to investigate factors associated with salary variability.

Specifically, there was interest in whether the following factors have any correlation with expected salaries:

-   global region
-   education level
-   job experience
-   age
-   gender


Interpretation of Results - Region

Due to the nature of our data, in determining whether there is a significant difference in the salary rates between any global regions, we can rely most heavily on our non-parametric test methods (random permutation test, bootstrap test, Kruskal-Walis test).

Ultimately though, all of these methods, as well as our parametric test methods, provide evidence in support of the alternative hypothesis that there is a significant difference in mean/median salaries across at least one pair of regions.

Going deeper, we can analyse pairwise differences. Northern America and Western Europe appear to be the global regions with the largest reliable median salaries. Note that if we only assess point estimates, Western Asia, Eastern Europe and Central America/South America/Caribbean regions actually have higher sample medians, yet with their much smaller sample sizes, we cannot be confident in these population estimates and this is reflected in their relatively large 95% confidence intervals which extend well below those of Northern America and Western Europe. Both Northern America and Western Europe are the only two regions to have statistically greater median salaries than other regions.

All three of Eastern Asia, Australia/New Zealand and Northern Europe are considered to have significantly lower median salaries than the two front runners. There’s a considerable practical significance with both Australia/New Zealand and Northern Europe having smaller median salary values than Northern America and Western Europe within a range of between US$10,000 and US$15,000. It is well known that USA/Canada and some Western Europe countries like United Kingdom, Germany have strong economies. However, it is interesting to note that these other ‘Western’ regions typically lag behind in regards to expected salaries.

Eastern Asia is coming in between $25,000 and $30,000 lower than the two leading regions-a highly significant result. Especially considering this region consists of some strong economic countries including Japan, and South Korea. It is likely the exchange rate has a considerable effect when making these comparisons in the US dollar currency.

There are a number of other regions with median salaries lower than Northern America and Western Europe (and in fact Australia/NZ and Northern Europe) as well, including Southern Asia, Africa, Southern Europe, and South-Eastern Asia.

However, most of these regions have relatively small sample sizes and were therefore excluded from our analysis. Statistical power was severely compromised for these groups. Observing the data we do have, I would be very inclined to predict that with a more balanced study design, and therefore more consistent sample sizes, we would obtain significant differences for these regions also.


Study Design and Improvements

Some interesting trends were observed and inferred through this analysis. There are, however, some constraints due to the nature of the data collection. It’s likely there is a degree of voluntary-response bias present in the study due to the self-selection nature of the survey.

One consequence of self-selection is very unequal sample sizes across most factor levels. For example, across the region factor, 92% of respondents are from the Northern America region, with the remaining 8% being spread over 11 separate global regions.

As a result, the accuracy of our sample estimates are somewhat unbalanced. Some regions contain very small relative sample sizes, for example South-Eastern Asia (n = 12). This means we have considerably reduced power in our statistical testing and it becomes difficult to determine a true population difference in salary. For those very small groups, we ended up having to discount them entirely from our inferential testing and for those groups remaining, we still needed to proceed with non-parametric statistical techniques that would be robust towards the unbalanced sample sizes.

Another limitation imposed by the survey design is the potential for confounding variables. For example, respondents to the study might over-represent individuals with certain beliefs or traits, such as people who are generally more content with their salary overall, thus skewing our sample estimates.

Some examples of confounding variables include: access to resources, the overall quality of one’s education and general health and well being.

Specific improvements to the study design can be considered, in order to minimize the issues mentioned. Most notably, improving the design to ensure more balanced and adequate sample sizes across factors are obtained. Including questions that extract information from respondents about some of the confounding variables mentioned above could help develop more accurate insights.



Session Info


This section is included for the sake of reproducability.

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Pacific/Auckland
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] FSA_0.9.5          ggstatsplot_0.12.2 car_3.1-2          carData_3.0-5     
##  [5] tseries_0.10-55    knitr_1.45         countrycode_1.5.0  lubridate_1.9.3   
##  [9] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
## [13] readr_2.1.4        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
## [17] tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0       Rmpfr_0.9-5            farver_2.1.1          
##  [4] statsExpressions_1.5.4 fastmap_1.1.1          bayestestR_0.13.2     
##  [7] digest_0.6.33          timechange_0.2.0       lifecycle_1.0.4       
## [10] multcompView_0.1-10    magrittr_2.0.3         compiler_4.3.2        
## [13] rlang_1.1.2            sass_0.4.8             tools_4.3.2           
## [16] utf8_1.2.4             yaml_2.3.8             ggsignif_0.6.4        
## [19] labeling_0.4.3         curl_5.2.0             TTR_0.24.4            
## [22] RColorBrewer_1.1-3     SuppDists_1.1-9.7      abind_1.4-5           
## [25] withr_3.0.0            grid_4.3.2             datawizard_0.10.0     
## [28] fansi_1.0.6            xts_0.13.2             dunn.test_1.3.5       
## [31] colorspace_2.1-0       paletteer_1.6.0        scales_1.3.0          
## [34] MASS_7.3-60            zeallot_0.1.0          insight_0.19.10       
## [37] cli_3.6.2              mvtnorm_1.2-4          rmarkdown_2.25        
## [40] generics_0.1.3         rstudioapi_0.15.0      tzdb_0.4.0            
## [43] parameters_0.21.6      cachem_1.0.8           effectsize_0.8.7      
## [46] vctrs_0.6.5            boot_1.3-30            jsonlite_1.8.8        
## [49] hms_1.1.3              patchwork_1.2.0        ggrepel_0.9.5         
## [52] correlation_0.8.4      BWStest_0.2.3          jquerylib_0.1.4       
## [55] PMCMRplus_1.9.10       quantmod_0.4.25        glue_1.6.2            
## [58] rematch2_2.1.2         stringi_1.8.3          gtable_0.3.4          
## [61] prismatic_1.1.1        quadprog_1.5-8         gmp_0.7-4             
## [64] munsell_0.5.0          pillar_1.9.0           htmltools_0.5.7       
## [67] R6_2.5.1               evaluate_0.23          lattice_0.21-9        
## [70] highr_0.10             memoise_2.0.1          bslib_0.6.1           
## [73] Rcpp_1.0.12            xfun_0.41              kSamples_1.2-10       
## [76] zoo_1.8-12             pkgconfig_2.0.3