library(carData)
mydata <- Salaries
mydata <- mydata %>% tidyr::drop_na()
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
A data frame with 397 observations on the following 6 variables:
Units of observation are Assistant Professors, Associate Professors and Professors in a college in the U.S.
Main goal of the data analysis: Main goal of the data analysis is to figure out how big are salary differences between male and female faculty members.
Source: Library called carData
Hypothesis: On average there is no difference between in salaries between male and female
\(H_0: \mu_M = \mu_F\)
\(H_1: \mu_M \neq \mu_F\)
\(\mu_M:\) average salary men earn
\(\mu_F:\) average salary women earn
describeBy(mydata$salary, mydata$sex)
##
## Descriptive statistics by group
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 39 101002.4 25952.13 103750 99531.06 35229.54 62884 161101 98217 0.42 -0.8 4155.67
## ------------------------------------------------------------------------------------------
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 358 115090.4 30436.93 108043 112748.1 29586.02 57800 231545 173745 0.71 0.15 1608.64
What must we check?
Variable is numeric: OK
Normality:
Male <- ggplot(mydata %>% filter(sex == "Male"), aes(x= salary)) +
geom_histogram(binwidth = 4000, fill = "blue", color = "black") +
theme_classic() +
labs( title= "Male salaries", xlab="Salary")
Female <- ggplot(mydata %>% filter(sex == "Female"), aes(x= salary))+
geom_histogram(binwidth = 4000, fill = "red", color = "black") +
theme_classic() +
labs( title= "Female salaries", xlab="Salary")
ggarrange(Male, Female,
ncol = 2)
The Male salaries don’t look normally distributed. However, Female salaries might be normally distributed. We can check the normality of distributions with Shapiro-Wilk test.
\(H_0:\) Salaries are normally distributed
\(H_1:\) Salaries are not normally distributed
mydata %>% group_by(sex) %>%
shapiro_test(salary)
## # A tibble: 2 × 4
## sex variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Female salary 0.947 0.0634
## 2 Male salary 0.959 0.0000000173
Because \(p\)-value for Male is \(<10\%\), we reject null hypothesis at \(p<0.001\).
Normality is violated, so we do non-parametric Wilcoxon Rank Sum test.
Hypothesis:
\(H_0:\) distribution locations are the same
\(H_1:\) distribution locations are not the same
wilcox.test(mydata$salary ~ mydata$sex,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$salary by mydata$sex
## W = 5182.5, p-value = 0.008219
## alternative hypothesis: true location shift is not equal to 0
We reject \(H_0\) at \(p<0.01\). Females earn lower salaries.
We can also check the effect size:
effectsize(wilcox.test(mydata$salary ~ mydata$sex,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.26 | [-0.43, -0.07]
interpret_rank_biserial(0.26)
## [1] "medium"
## (Rules: funder2019)
Effect size is meduim.
Conclusion: Based on the sample data, we find that male in female salaries differ (\(p<0.01\)) - male earn higher salaries, the difference in distribution is medium (\(r=0.26\)).
mydata1 <- mydata %>% group_by(discipline) %>% dplyr::summarise(num = n())
The sample of 397 included 181 from A (“theoretical” departments) and 216 from B (“applied” departments). Can we conclude that more people that works at univesities works in applied departments?
Assumptions are meet. Both \(n \pi_0 > 5\) and \(n (1-\pi_0) > 5\) Hypothesis:
\(H_0: \pi = 0.5\)
\(H_1: \pi > 0.5\)
prop.test(x=216,
n=397,
p=0.5,
correct = FALSE,
alternative = "greater")
##
## 1-sample proportions test without continuity correction
##
## data: 216 out of 397, null probability 0.5
## X-squared = 3.0856, df = 1, p-value = 0.03949
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.5028048 1.0000000
## sample estimates:
## p
## 0.5440806
\(p\)- value = 3.9%. So we can reject \(H_0\) at \(p<0.05%\). Meaning that more people work in applied departments.
Probability of working at applied department is 54,4%.