source: https://www.kaggle.com/datasets/nilimajauhari/glassdoor-analyze-gender-pay-gap The data set has been taken from glassdoor and focuses on income for various job titles based on gender. As there have been many studies showcasing that women are paid less than men for the same job titles, this data set will be helpful in identifying the depth of the gender-based pay gap.
Is there a statistically significant difference in the annual basic pay between male and female employees?
Gender: This is a categorical variable that classifies
employees into two groups: Male and Female.
BasePay: This is a continuous variable representing the
annual basic salary of each employee in dollars.
Null Hypothesis (H0): There is no difference in median annual basic pay between male and female employees.
Alternative Hypothesis (H1): There is a difference in median annual basic pay between male and female employees.
mydata <- read.table("./Gender Pay Gap.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## Gender BasePay
## 1 Female 42363
## 2 Male 108476
## 3 Female 90208
## 4 Male 108080
## 5 Male 99464
## 6 Female 70890
Gender: Male, FemaleBasePay: Annual basic pay in $Unit of observation: One employeeSample size: 1,000 employeesmydata$GenderF <- factor(mydata$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
This conversion ensures that the Gender data is treated correctly in
statistical tests and visualizations, recognizing Male and
Female as the two distinct categories.
library(psych)
describeBy(mydata$BasePay, group = mydata$GenderF)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## X1 1 532 98457.55 25517.52 98223 97881.3 25539.27 36642 179726 143084 0.22
## kurtosis se
## X1 -0.16 1106.32
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## X1 1 468 89942.82 24378.28 89913.5 89813.96 25229.4 34208 160614 126406 0.06
## kurtosis se
## X1 -0.36 1126.89
Mean Salary: Males have a higher average salary at $98,457.55 compared to females at $89,942.82. This suggests that on average, males earn more than females.
Median Salary: Similarly, the median salary is also higher for males ($98,223) compared to females ($89,913.5), indicating a central tendency for males to earn more than females.
Minimum Salary: The lowest salary is slightly higher for males ($36,642) than for females ($34,208), showing that the base level pay starts lower for females.
Maximum Salary: The highest salary recorded is greater for males ($179,726) than for females ($160,614), which suggests that males also reach higher pay levels than females.
The Wilcoxon Rank Sum Test is the appropriate choice. This test is used because:
Data Type: It is suited for comparing two independent samples where
the dependent variable (BasePay) is continuous but not
necessarily normally distributed.
Distribution: It does not require the assumption of normal distribution, which is often not met in salary data due to outliers and skewness.
Independence of Observations: Each subject (BasePay)
from one group (Gender) is independent of the others. This
is generally assumed to be true as each salary should be independently
determined.
Variable is numeric: The salary data is numeric and measured in dollars.
Shape of Distributions: Ideally, the two groups should have similar shapes of their distributions. This assumption can be visually checked through histograms.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata, aes(x = BasePay)) +
geom_histogram(binwidth = 10000, colour="gray") +
facet_wrap(~GenderF, ncol = 1) +
ylab("Frequency")
Data Spread: The distribution of BasePay appears to be normal distibuted for both males and females.
#install.packages("rstatix")
Shapiro-Wilk test is used to determine whether a sample comes from a normally distributed population.
H0 for male: Base pay for men is normally distributed.
H1 for male: Base pay for men is not normally distributed.
H0 for female: Base pay for women is normally distributed.
H1 for female: Base pay for women is not normally distributed.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(GenderF) %>%
shapiro_test(BasePay)
## # A tibble: 2 × 4
## GenderF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male BasePay 0.994 0.0493
## 2 Female BasePay 0.996 0.246
We can not reject H0 for female, since p value is more than 5%. We also can not reject H0 for male, because the Shapiro-Wilk test can be sensitive to large sample sizes, often leading to the rejection of normality for even slight deviations from a normal distribution that may not be practically significant. By looking at the histogram above, we can see the normal distribution.
H0: Locations of distribution of median annual basic pay between male and female employees are the same.
H1: Locations of distribution of median annual basic pay between male and female employees are not the same.
wilcox.test(mydata$BasePay ~ mydata$GenderF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$BasePay by mydata$GenderF
## W = 146978, p-value = 8.021e-07
## alternative hypothesis: true location shift is not equal to 0
We reject H0 at p value < 0.001
The test has found evidence to suggest that there is a significant difference in the central tendency of the salary distributions between the two genders.
#install.packages("effectsize")
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydata$BasePay ~ mydata$GenderF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## --------------------------------
## 0.18 | [0.11, 0.25]
interpret_rank_biserial(0.18)
## [1] "small"
## (Rules: funder2019)
The Wilcoxon Rank Sum test indicated that there is a statistically significant difference in salaries between male and female employees. The rank biserial effect size of 0.18, which is considered small according to the provided rules, means that while there is a difference, it may not be a large one in practical terms. By looking at the histograms.
Research Question: Is there a statistically significant difference in the annual basic pay between male and female employees?
Answer: The Wilcoxon Rank Sum test revealed a statistically significant difference in salaries between male and female employees (p value < 0.001). This finding suggests that the two groups do not have the same distribution of salaries.
Furthermore, the rank biserial correlation effect size was determined to be 0.18, which is considered a small effect size according to Funder’s guidelines. This means that while there is a statistically significant difference, but the magnitude of this difference in pay is not large.
In conclusion, the statistical analysis supports the presence of a gender pay gap, indicating that male and female employees have different salary distributions, with the difference being statistically significant but small in effect size. This suggests that on average, there is a disparity in pay between genders, but the practical significance of this disparity, while present, is small.