Author: Haley Grace Henson
library(readxl)
mydata <- read_xlsx("./Salary_Data.xlsx")
Does gender have an effect on salary?
Gender: Male, Female (factored as 0 = Male ,1 = Female)
Salary: in terms of $ made annually
H0:There is no difference in the mean salary between men and women
H1: There is a difference in the mean salary between men and women
H0: Mu Salary Men = Mu Salary Women
H1: Mu Salary Men does NOT = Mu Salary Women
head(mydata)
## # A tibble: 6 × 6
## Age Gender `Education Level` `Job Title` `Years of Experience` Salary
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 32 Male Bachelor's Software Engineer 5 90000
## 2 28 Female Master's Data Analyst 3 65000
## 3 45 Male PhD Senior Manager 15 150000
## 4 36 Female Bachelor's Sales Associate 7 60000
## 5 52 Male Master's Director 20 200000
## 6 29 Male Bachelor's Marketing Analyst 2 55000
In The Entire Dataset:
Age: in years
Gender: male or female
Education Level: Bachelor’s, Master’s, PhD
Job Title: Name of Job
Years of Experience: in years
Salary: in terms of $ made annually
Unit of Observation: One Employed Person
Sample Size: 6684 employees.
https://www.kaggle.com/datasets/mohithsairamreddy/salary-data
mydata$GenderF <- factor(mydata$Gender,
levels = c("Male","Female"),
labels = c("Male","Female"))
mycleandata<- na.omit(mydata)
na.omit() was used in another r class I took at my home uni. I clarified how to use it by reading https://rpubs.com/jianiteo/naomit
mean(mycleandata$Salary)
## [1] 115307.2
The average salary of an employee is $115,507.2.
median(mycleandata$`Years of Experience`)
## [1] 7
The median years of experience is 7 years. 50% of the employees have less than and up to 7 years of experince. The other 50% have more than 7 years of experience.
Independent samples t-test because this data set contains two independent populations, male and female, and they are only tested one time.
library(psych)
describeBy(mycleandata$Salary, g = mycleandata$GenderF)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## X1 1 3671 121395.7 52098.63 120000 121698.9 71164.8 350 250000 249650 0
## kurtosis se
## X1 -1.2 859.87
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## X1 1 3013 107889 52723.61 105000 106776.6 66717 500 220000 219500 0.15
## kurtosis se
## X1 -1.13 960.52
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mycleandata, aes(x = Salary)) +
geom_histogram(binwidth = 200, color = "pink") +
facet_wrap(~GenderF, ncol = 5) +
ylab("Frequency")
Hypothesis Test For Men:
H0: Salary’s earned by men is normally distributed
H1: Salary’s earned by men is NOT normally distributed
Hypothesis Test for Women:
H0: Salary’s earned by women is normally distributed
H1: Salary’s earned by women is NOT normally distributed
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mycleandata %>%
group_by(GenderF) %>%
shapiro_test(Salary)
## # A tibble: 2 × 4
## GenderF variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male Salary 0.956 4.86e-32
## 2 Female Salary 0.957 4.19e-29
We reject the null for Male Hypothesis Test at p < 0.001
We reject the null for Female Hypothesis Test at p < 0.001
wilcox.test(mycleandata$Salary ~ mycleandata$GenderF,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mycleandata$Salary by mycleandata$GenderF
## W = 6338160, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
H0: Locations of distributions of salary’s earned are the same for male and female
H1: Locations of distributions of salary’s earned are NOT the same for male and female
Reject Null at p < 0.001
Code Used:
install.packages(“effectsize”) library(effectsize)
effectsize(wilcox.test(mycleandata\(Salary ~ mycleandata\)GenderF, paired = FALSE, correct = FALSE, exact = FALSE, alternative = “two.sided”))
The effect size was not calculable for this data set.
Based on the sample data, we find that men and women differ in terms of salary earned (p < 0.001). The difference in distribution is not able to be evaluated based on effect size due to a data related issue.