library(carData)
mydata <- Salaries
mydata <- mydata %>% tidyr::drop_na()


head(mydata)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

A data frame with 397 observations on the following 6 variables:

Units of observation are Assistant Professors, Associate Professors and Professors in a college in the U.S.

Main goal of the data analysis: Main goal of the data analysis is to figure out how big are salary differences between male and female faculty members.

Source: Library called carData

HYPOTHESIS TESTING

1) Is there the difference in salaries between female and male?

Hypothesis: On average there is no difference between in salaries between male and female

\(H_0: \mu_M = \mu_F\)

\(H_1: \mu_M \neq \mu_F\)

\(\mu_M:\) average salary men earn

\(\mu_F:\) average salary women earn

describeBy(mydata$salary, mydata$sex)
## 
##  Descriptive statistics by group 
## group: Female
##    vars  n     mean       sd median  trimmed      mad   min    max range skew kurtosis      se
## X1    1 39 101002.4 25952.13 103750 99531.06 35229.54 62884 161101 98217 0.42     -0.8 4155.67
## ------------------------------------------------------------------------------------------ 
## group: Male
##    vars   n     mean       sd median  trimmed      mad   min    max  range skew kurtosis      se
## X1    1 358 115090.4 30436.93 108043 112748.1 29586.02 57800 231545 173745 0.71     0.15 1608.64

What must we check?

  • Variable is numeric: OK

  • Normality:

Male <- ggplot(mydata %>% filter(sex == "Male"), aes(x= salary)) +
  geom_histogram(binwidth = 4000, fill = "blue", color = "black") +
  theme_classic() + 
  labs( title= "Male salaries", xlab="Salary")

Female <- ggplot(mydata %>% filter(sex == "Female"), aes(x= salary))+
geom_histogram(binwidth = 4000, fill = "red", color = "black") +
  theme_classic() + 
  labs( title= "Female salaries", xlab="Salary")



ggarrange(Male, Female, 
          ncol = 2)

The Male salaries don’t look normally distributed. However, Female salaries might be normally distributed. We can check the normality of distributions with Shapiro-Wilk test.

\(H_0:\) Salaries are normally distributed

\(H_1:\) Salaries are not normally distributed

mydata %>% group_by(sex) %>% 
           shapiro_test(salary) 
## # A tibble: 2 × 4
##   sex    variable statistic            p
##   <fct>  <chr>        <dbl>        <dbl>
## 1 Female salary       0.947 0.0634      
## 2 Male   salary       0.959 0.0000000173

Because \(p\)-value for Male is \(<10\%\), we reject null hypothesis at \(p<0.001\).

Normality is violated, so we do non-parametric Wilcoxon Rank Sum test.

Hypothesis:

\(H_0:\) distribution locations are the same

\(H_1:\) distribution locations are not the same

wilcox.test(mydata$salary ~ mydata$sex,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$salary by mydata$sex
## W = 5182.5, p-value = 0.008219
## alternative hypothesis: true location shift is not equal to 0

We reject \(H_0\) at \(p<0.01\). Females earn lower salaries.

We can also check the effect size:

effectsize(wilcox.test(mydata$salary ~ mydata$sex,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.26             | [-0.43, -0.07]
interpret_rank_biserial(0.26)
## [1] "medium"
## (Rules: funder2019)

Effect size is meduim.

Conclusion: Based on the sample data, we find that male in female salaries differ (\(p<0.01\)) - male earn higher salaries, the difference in distribution is medium (\(r=0.26\)).

2) - Test of population proportion

mydata1 <- mydata %>%  group_by(discipline) %>% dplyr::summarise(num = n())

The sample of 397 included 181 from A (“theoretical” departments) and 216 from B (“applied” departments). Can we conclude that more people that works at univesities works in applied departments?

Assumptions are meet. Both \(n \pi_0 > 5\) and \(n (1-\pi_0) > 5\) Hypothesis:

\(H_0: \pi = 0.5\)

\(H_1: \pi > 0.5\)

prop.test(x=216,
          n=397,
          p=0.5,
          correct = FALSE,
          alternative = "greater")
## 
##  1-sample proportions test without continuity correction
## 
## data:  216 out of 397, null probability 0.5
## X-squared = 3.0856, df = 1, p-value = 0.03949
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.5028048 1.0000000
## sample estimates:
##         p 
## 0.5440806

\(p\)- value = 3.9%. So we can reject \(H_0\) at \(p<0.05%\). Meaning that more people work in applied departments.

Probability of working at applied department is 54,4%.