Homework 2

Author: Haley Grace Henson

library(readxl)
mydata <- read_xlsx("./Salary_Data.xlsx")

1 Research question

1.1 Research Question

Does gender have an effect on salary?

1.2 Variables Needed

Gender: Male, Female (factored as 0 = Male ,1 = Female)

Salary: in terms of $ made annually

1.3 Research Hypotheses

H0:There is no difference in the mean salary between men and women

H1: There is a difference in the mean salary between men and women

H0: Mu Salary Men = Mu Salary Women

H1: Mu Salary Men does NOT = Mu Salary Women

head(mydata)

## # A tibble: 6 × 6
##     Age Gender `Education Level` `Job Title`       `Years of Experience` Salary
##   <dbl> <chr>  <chr>             <chr>                             <dbl>  <dbl>
## 1    32 Male   Bachelor's        Software Engineer                     5  90000
## 2    28 Female Master's          Data Analyst                          3  65000
## 3    45 Male   PhD               Senior Manager                       15 150000
## 4    36 Female Bachelor's        Sales Associate                       7  60000
## 5    52 Male   Master's          Director                             20 200000
## 6    29 Male   Bachelor's        Marketing Analyst                     2  55000

2 Data

2.1 Definition of Variables:

In The Entire Dataset:

Age: in years

Gender: male or female

Education Level: Bachelor’s, Master’s, PhD

Job Title: Name of Job

Years of Experience: in years

Salary: in terms of $ made annually

2.2 Unit of Observation and Sample Size:

Unit of Observation: One Employed Person

Sample Size: 6684 employees.

2.3 Source of the data set:

https://www.kaggle.com/datasets/mohithsairamreddy/salary-data

mydata$GenderF <- factor(mydata$Gender,
                         levels = c("Male","Female"),
                         labels = c("Male","Female"))

2.4 Basic descriptive statistics:

mycleandata<- na.omit(mydata)

na.omit() was used in another r class I took at my home uni. I clarified how to use it by reading https://rpubs.com/jianiteo/naomit

mean(mycleandata$Salary)

## [1] 115307.2

The average salary of an employee is $115,507.2.

median(mycleandata$`Years of Experience`)

## [1] 7

The median years of experience is 7 years. 50% of the employees have less than and up to 7 years of experince. The other 50% have more than 7 years of experience.

3 Analysis

3.1 Statistical Test and why:

Independent samples t-test because this data set contains two independent populations, male and female, and they are only tested one time.

3.2 Evaluate all assumptions (1.5 points)

library(psych)
describeBy(mycleandata$Salary, g = mycleandata$GenderF)

## 
##  Descriptive statistics by group 
## group: Male
##    vars    n     mean       sd median  trimmed     mad min    max  range skew
## X1    1 3671 121395.7 52098.63 120000 121698.9 71164.8 350 250000 249650    0
##    kurtosis     se
## X1     -1.2 859.87
## ------------------------------------------------------------ 
## group: Female
##    vars    n   mean       sd median  trimmed   mad min    max  range skew
## X1    1 3013 107889 52723.61 105000 106776.6 66717 500 220000 219500 0.15
##    kurtosis     se
## X1    -1.13 960.52

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mycleandata, aes(x = Salary)) +
  geom_histogram(binwidth = 200, color = "pink") +
  facet_wrap(~GenderF, ncol = 5) +
  ylab("Frequency")

Hypothesis Test For Men:

H0: Salary’s earned by men is normally distributed

H1: Salary’s earned by men is NOT normally distributed

Hypothesis Test for Women:

H0: Salary’s earned by women is normally distributed

H1: Salary’s earned by women is NOT normally distributed

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mycleandata %>%
  group_by(GenderF) %>%
  shapiro_test(Salary)

## # A tibble: 2 × 4
##   GenderF variable statistic        p
##   <fct>   <chr>        <dbl>    <dbl>
## 1 Male    Salary       0.956 4.86e-32
## 2 Female  Salary       0.957 4.19e-29

We reject the null for Male Hypothesis Test at p < 0.001

We reject the null for Female Hypothesis Test at p < 0.001

3.3 Appropriate statistical test based on the results of the assumption evaluation and its interpretation:

wilcox.test(mycleandata$Salary ~ mycleandata$GenderF,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mycleandata$Salary by mycleandata$GenderF
## W = 6338160, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

H0: Locations of distributions of salary’s earned are the same for male and female

H1: Locations of distributions of salary’s earned are NOT the same for male and female

Reject Null at p < 0.001

3.4 Calculation of the effect size and its interpretation:

Code Used:

install.packages(“effectsize”) library(effectsize)

effectsize(wilcox.test(mycleandata$Salary ~ mycleandata$GenderF, paired = FALSE, correct = FALSE, exact = FALSE, alternative = “two.sided”))

The effect size was not calculable for this data set.

4 Conclusion

Based on the sample data, we find that men and women differ in terms of salary earned (p < 0.001). The difference in distribution is not able to be evaluated based on effect size due to a data related issue.