Salaries %>%
group_by(sex, rank, discipline) %>%
get_summary_stats(salary, type = "common")
## # A tibble: 12 × 13
## rank disci…¹ sex varia…² n min max median iqr mean sd
## <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AsstPr… A Fema… salary 6 63100 78500 73000 4000 7.29e4 5463.
## 2 AsstPr… B Fema… salary 5 74692 97032 80225 15000 8.42e4 9792.
## 3 AssocP… A Fema… salary 4 62884 77500 74065 4802. 7.21e4 6403.
## 4 AssocP… B Fema… salary 6 71065 109650 103872 758. 9.94e4 14086.
## 5 Prof A Fema… salary 8 90450 137000 109800 15226. 1.10e5 15095.
## 6 Prof B Fema… salary 10 105450 161101 128256 20214. 1.32e5 17504.
## 7 AsstPr… A Male salary 18 63900 85000 74000 3086 7.43e4 4580.
## 8 AsstPr… B Male salary 38 68404 95079 85408 10906. 8.46e4 6900.
## 9 AssocP… A Male salary 22 70000 108413 82350 6835 8.50e4 10612.
## 10 AssocP… B Male salary 32 83900 126431 100730 10231 1.02e5 9608.
## 11 Prof A Male salary 123 57800 205500 113278 34290 1.21e5 28505.
## 12 Prof B Male salary 125 67559 231545 132825 35876 1.34e5 26514.
## # … with 2 more variables: se <dbl>, ci <dbl>, and abbreviated variable names
## # ¹discipline, ²variable
ggplot(Salaries, aes(x=salary,fill = rank)) +
geom_histogram(bins=15) +
facet_grid(sex ~ discipline) +
labs(title = "Dependency of salary in relation to other factors",
x = "Salary",
y = "Number of observations") +
theme_light()
ggboxplot(Salaries, x = "discipline", y = "salary",
color = "rank", palette = "aas", facet.by = "sex")+
geom_jitter(aes(color = discipline)) +
labs(title = "Dependency of salary in relation to other factors",
x = "discipline",
y = "Salary") +
theme_light()
outliers<-Salaries %>%
group_by(sex, rank, discipline) %>%
identify_outliers(salary)
There are 18 outliers in the dataset out of 397 observations. Moreover, there are 3 extreme ones.
Salaries %>%
group_by(discipline, rank, sex) %>%
shapiro_test(salary)
## # A tibble: 12 × 6
## rank discipline sex variable statistic p
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 AsstProf A Female salary 0.870 0.226
## 2 AsstProf A Male salary 0.941 0.300
## 3 AssocProf A Female salary 0.863 0.269
## 4 AssocProf A Male salary 0.878 0.0113
## 5 Prof A Female salary 0.934 0.549
## 6 Prof A Male salary 0.952 0.000259
## 7 AsstProf B Female salary 0.889 0.354
## 8 AsstProf B Male salary 0.941 0.0458
## 9 AssocProf B Female salary 0.635 0.00117
## 10 AssocProf B Male salary 0.967 0.416
## 11 Prof B Female salary 0.974 0.923
## 12 Prof B Male salary 0.978 0.0435
ggqqplot(Salaries, "salary", ggtheme = theme_light()) +
facet_grid(sex + rank ~ discipline)+
labs(title = "Normality of salary in relation to other factors")
Not every group has passed the normality test, probably because of the extreme outliers. There were 5 groups that didn’t passed, so we need to remove outliers from our dataset.
Salaries$salary[Salaries$salary%in%outliers$salary] <- NA
Salaries <- na.omit(Salaries)
Salaries %>%
group_by(discipline, rank, sex) %>%
shapiro_test(salary)
## # A tibble: 12 × 6
## rank discipline sex variable statistic p
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 AsstProf A Female salary 0.813 0.103
## 2 AsstProf A Male salary 0.953 0.581
## 3 AssocProf A Female salary 0.976 0.703
## 4 AssocProf A Male salary 0.899 0.0787
## 5 Prof A Female salary 0.934 0.549
## 6 Prof A Male salary 0.967 0.00457
## 7 AsstProf B Female salary 0.889 0.354
## 8 AsstProf B Male salary 0.941 0.0458
## 9 AssocProf B Female salary 0.916 0.514
## 10 AssocProf B Male salary 0.976 0.698
## 11 Prof B Female salary 0.974 0.923
## 12 Prof B Male salary 0.987 0.268
ggqqplot(Salaries, "salary", ggtheme = theme_light()) +
facet_grid(sex + rank ~ discipline)+
labs(title = "Normality of salary in relation to other factors")
All of the groups passed the normality test except two, but it is fine
:)
Salaries %>%
levene_test(salary~sex*discipline*rank)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 11 366 10.8 2.39e-17
model<-lm(salary~sex*discipline*rank,data=Salaries)
plot(model, 1)
To test homogeneity of variance we used Levene Test. The p-value is <
0.05 (it is almost 0). This means that, there is a significant
difference between variances across groups. Therefore, we cannot assume
the homogeneity of variances in the different discipline groups. On the
plot, there is a significant relationships between residuals and fitted
values.
x<-Salaries %>%
group_by(discipline, rank, sex)
Salaries_log <-x
Salaries_log$salary<-log(x$salary)
shapiro_test(Salaries_log, salary)
## # A tibble: 12 × 6
## rank discipline sex variable statistic p
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 AsstProf A Female salary 0.813 0.104
## 2 AsstProf A Male salary 0.952 0.560
## 3 AssocProf A Female salary 0.978 0.717
## 4 AssocProf A Male salary 0.891 0.0587
## 5 Prof A Female salary 0.936 0.575
## 6 Prof A Male salary 0.985 0.212
## 7 AsstProf B Female salary 0.896 0.387
## 8 AsstProf B Male salary 0.931 0.0218
## 9 AssocProf B Female salary 0.916 0.517
## 10 AssocProf B Male salary 0.981 0.840
## 11 Prof B Female salary 0.976 0.939
## 12 Prof B Male salary 0.986 0.236
We have improved the homogenity by applying the logarithm to Salaries dataset and only one group does not pass the normality test.
results <- aov(data=x,salary ~ sex*rank*discipline)
anova(results)
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 3795659233 3795659233 8.43 0.0039
## rank 2 120588185909 60294092955 133.91 < 0.0000000000000002
## discipline 1 20151319596 20151319596 44.76 0.000000000084
## sex:rank 2 198319490 99159745 0.22 0.8024
## sex:discipline 1 220911511 220911511 0.49 0.4841
## rank:discipline 2 622220372 311110186 0.69 0.5017
## sex:rank:discipline 2 153372867 76686433 0.17 0.8435
## Residuals 366 164793375995 450255126
##
## sex **
## rank ***
## discipline ***
## sex:rank
## sex:discipline
## rank:discipline
## sex:rank:discipline
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results)
We can see that rank as well as discipline very strongly influence the salary, the sex is also a quite important factor. There is no interaction between these three variables (no stars :( ).
anova(aov(data = x, salary ~ yrs.since.phd + rank * discipline * sex))
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.since.phd 1 50298509853 50298509853 111.41 < 0.0000000000000002
## rank 2 74563223858 37281611929 82.58 < 0.0000000000000002
## discipline 1 19486108432 19486108432 43.16 0.00000000017
## sex 1 189041166 189041166 0.42 0.52
## rank:discipline 2 634244017 317122009 0.70 0.50
## rank:sex 2 152688244 76344122 0.17 0.84
## discipline:sex 1 254773152 254773152 0.56 0.45
## rank:discipline:sex 2 152568823 76284412 0.17 0.84
## Residuals 365 164792207427 451485500
##
## yrs.since.phd ***
## rank ***
## discipline ***
## sex
## rank:discipline
## rank:sex
## discipline:sex
## rank:discipline:sex
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(data = x, salary ~ yrs.service + rank * discipline * sex))
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.service 1 33334824501 33334824501 74.00 0.00000000000000023
## rank 2 92013716406 46006858203 102.14 < 0.0000000000000002
## discipline 1 19288510828 19288510828 42.82 0.00000000020329808
## sex 1 241485520 241485520 0.54 0.46
## rank:discipline 2 606493538 303246769 0.67 0.51
## rank:sex 2 186373233 93186616 0.21 0.81
## discipline:sex 1 280048048 280048048 0.62 0.43
## rank:discipline:sex 2 158588069 79294035 0.18 0.84
## Residuals 365 164413324830 450447465
##
## yrs.service ***
## rank ***
## discipline ***
## sex
## rank:discipline
## rank:sex
## discipline:sex
## rank:discipline:sex
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By performing Ancova, we can see that yrs.since.phd singificantly affects salary. It affects the salary even more than discipline variable.The same situation applies to yrs.service.
model <- lm(cbind(yrs.since.phd, yrs.service) ~ salary, x)
Manova(model, test.statistic = "Pillai")
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## salary 1 0.171 38.7 2 375 0.00000000000000055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value is lower than 0,05 - almost 0, so we know that there is a significant difference in years since PhD and seniority of different rank professors