Assignment
Task 1: 3-way Anova with/without interaction
Descriptive statistics
ggplot(Salaries, aes(x=salary)) +
geom_histogram(bins=10) +
facet_grid(sex ~ discipline+rank)bxp <- ggboxplot(Salaries, y="salary", x="sex", color ="discipline") + facet_wrap(~rank)
bxp The distributions within each cell look pretty wonky:
xtabs(~ sex + rank + discipline, data = Salaries)## , , discipline = A
##
## rank
## sex AsstProf AssocProf Prof
## Female 6 4 8
## Male 18 22 123
##
## , , discipline = B
##
## rank
## sex AsstProf AssocProf Prof
## Female 5 6 10
## Male 38 32 125
Outlier
Identify outlier in each cell design:
Salaries %>%
group_by(sex, rank, discipline) %>%
identify_outliers(salary)## # A tibble: 18 x 8
## rank discipline sex yrs.since.phd yrs.service salary is.outlier is.extreme
## <fct> <fct> <fct> <int> <int> <int> <lgl> <lgl>
## 1 Asst~ A Fema~ 7 6 63100 TRUE FALSE
## 2 Asso~ A Fema~ 25 22 62884 TRUE FALSE
## 3 Asso~ B Fema~ 14 7 109650 TRUE TRUE
## 4 Asso~ B Fema~ 12 9 71065 TRUE TRUE
## 5 Asst~ A Male 3 1 63900 TRUE FALSE
## 6 Asst~ A Male 2 0 85000 TRUE TRUE
## 7 Asst~ A Male 8 4 81035 TRUE FALSE
## 8 Asso~ A Male 14 8 100102 TRUE FALSE
## 9 Asso~ A Male 9 7 70000 TRUE FALSE
## 10 Asso~ A Male 11 1 104800 TRUE FALSE
## 11 Asso~ A Male 45 39 70700 TRUE FALSE
## 12 Asso~ A Male 10 1 108413 TRUE FALSE
## 13 Asso~ A Male 11 8 104121 TRUE FALSE
## 14 Asso~ B Male 13 11 126431 TRUE FALSE
## 15 Prof A Male 29 7 204000 TRUE FALSE
## 16 Prof A Male 42 18 194800 TRUE FALSE
## 17 Prof A Male 43 43 205500 TRUE FALSE
## 18 Prof B Male 38 38 231545 TRUE FALSE
There are some extreme outliers but for they are in the group which has had small samples I will ignore them for further researching.
normality
Check normality assumption by analyzing the model residuals. QQ plot and Shapiro-Wilk test of normality are used.
Salaries %>%
group_by(sex, rank, discipline) %>%
shapiro_test(salary)## # A tibble: 12 x 6
## rank discipline sex variable statistic p
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 AsstProf A Female salary 0.870 0.226
## 2 AsstProf B Female salary 0.889 0.354
## 3 AssocProf A Female salary 0.863 0.269
## 4 AssocProf B Female salary 0.635 0.00117
## 5 Prof A Female salary 0.934 0.549
## 6 Prof B Female salary 0.974 0.923
## 7 AsstProf A Male salary 0.941 0.300
## 8 AsstProf B Male salary 0.941 0.0458
## 9 AssocProf A Male salary 0.878 0.0113
## 10 AssocProf B Male salary 0.967 0.416
## 11 Prof A Male salary 0.952 0.000259
## 12 Prof B Male salary 0.978 0.0435
From the normality test, we find most are in normality. We can see it more clearly in plot below, which showed most of them closed to the line, except for there is one in the B assocProf female.
ggqqplot(Salaries, "salary") +
facet_grid(sex~discipline+rank) That is also the extreme outlier, we can delete it.
Salaries= Salaries[-317,]Homogeneity of variance
This can be checked using the Levene’s test:
levene_test(salary~sex*discipline*rank, data = Salaries)## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 11 384 9.34 6.49e-15
#bartlett.test(salary ~ interaction(sex,discipline,rank), data=Salaries)The p<0.05 show that there is significant difference between variances across groups.
Actually I should say it doesn’t meet requirement of anova, and if it is one-way ANOVA, we can use Welch one-way ANOVA. But this is 3-way anova.
But we can see in the plot.
model<-lm(salary~sex*discipline*rank,data=Salaries)
plot(model, 1) In the plot above, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. So, we can assume the homogeneity of variances.
ANOVA
interaraciton
results1 <- aov(data=Salaries,salary~rank*sex*discipline)
anova(results1)## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 2 141937501497 70968750748 137.16 < 0.0000000000000002
## sex 1 672768654 672768654 1.30 0.25
## discipline 1 18551357960 18551357960 35.86 0.0000000049
## rank:sex 2 193942177 96971088 0.19 0.83
## rank:discipline 2 569416536 284708268 0.55 0.58
## sex:discipline 1 621295022 621295022 1.20 0.27
## rank:sex:discipline 2 250701313 125350656 0.24 0.78
## Residuals 384 198680773772 517397848
##
## rank ***
## sex
## discipline ***
## rank:sex
## rank:discipline
## sex:discipline
## rank:sex:discipline
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results1)The ANOVA (formula: salary ~ rank * sex * discipline) suggests that:
- The main effect of rank is statistically significant and large (F(2, 384) = 137.16, p < .001; Eta2 (partial) = 0.42, 95% CI [0.36, 1.00])
- The main effect of sex is statistically not significant and very small (F(1, 384) = 1.30, p = 0.255; Eta2 (partial) = 3.37e-03, 95% CI [0.00, 1.00])
- The main effect of discipline is statistically significant and medium (F(1, 384) = 35.86, p < .001; Eta2 (partial) = 0.09, 95% CI [0.05, 1.00])
- The interaction between rank and sex is statistically not significant and very small (F(2, 384) = 0.19, p = 0.829; Eta2 (partial) = 9.75e-04, 95% CI [0.00, 1.00])
- The interaction between rank and discipline is statistically not significant and very small (F(2, 384) = 0.55, p = 0.577; Eta2 (partial) = 2.86e-03, 95% CI [0.00, 1.00])
- The interaction between sex and discipline is statistically not significant and very small (F(1, 384) = 1.20, p = 0.274; Eta2 (partial) = 3.12e-03, 95% CI [0.00, 1.00])
- The interaction between rank, sex and discipline is statistically not significant and very small (F(2, 384) = 0.24, p = 0.785; Eta2 (partial) = 1.26e-03, 95% CI [0.00, 1.00])
Effect sizes were labelled following Field’s (2013) recommendations.
without interaction
results2 <- aov(data=Salaries,salary~rank+sex+discipline)
anova(results2)## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## rank 2 141937501497 70968750748 138.52 < 0.0000000000000002 ***
## sex 1 672768654 672768654 1.31 0.25
## discipline 1 18551357960 18551357960 36.21 0.0000000041 ***
## Residuals 391 200316128820 512317465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep2<-report(results2)The ANOVA (formula: salary ~ rank + sex + discipline) suggests that:
- The main effect of rank is statistically significant and large (F(2, 391) = 138.52, p < .001; Eta2 (partial) = 0.41, 95% CI [0.36, 1.00])
- The main effect of sex is statistically not significant and very small (F(1, 391) = 1.31, p = 0.253; Eta2 (partial) = 3.35e-03, 95% CI [0.00, 1.00])
- The main effect of discipline is statistically significant and medium (F(1, 391) = 36.21, p < .001; Eta2 (partial) = 0.08, 95% CI [0.05, 1.00])
Effect sizes were labelled following Field’s (2013) recommendations.
main effects
Salaries %>%
group_by(discipline) %>%
anova_test(salary ~ rank, error = model)## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 2 x 8
## discipline Effect DFn DFd F p `p<.05` ges
## * <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 A rank 2 384 60.6 1.38e-23 * 0.24
## 2 B rank 2 384 86.3 1.10e-31 * 0.31
Salaries %>%
group_by(rank) %>%
anova_test(salary ~ discipline, error = model)## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 3 x 8
## rank Effect DFn DFd F p `p<.05` ges
## * <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 AsstProf discipline 1 384 3.38 0.067 "" 0.009
## 2 AssocProf discipline 1 384 10.7 0.001 "*" 0.027
## 3 Prof discipline 1 384 23.2 0.00000207 "*" 0.057
Comparing the result above, we can see that discipline is the main effect.
summary
It has some extreme outliers, but I ignored.
It is not normality from the p value, but from plot it is almost along the line except for one point. And I delete that point in later test. There is significant difference between variances across groups. But from the plot we can assume it meet requirement to do further research, otherwise we should give up it.
The anova result shows that there are significant difference by rank or discipline, but not by sex. And when cosider interaction between them, the p value also greater than 0.05, which cannot reject H0 which there is no significant difference.
Last, the discipline is the main effect.
Task 2: ANCOVA (based on task 1)
For research the influence brought by covariates, I do the ANCOVA which we can reasearch whether there is influence when we take other 2 numeric variables into consideration.
ANCOVA
I did three different ancov test below, one is years since PhD (yrs.since.phd), one is seniority (yrs.service) and both of them be taken into considerarion.
results3 <- aov(data=Salaries,salary~yrs.since.phd+sex*rank*discipline)
anova(results3)## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.since.phd 1 63088105862 63088105862 121.69 < 0.0000000000000002
## sex 1 1773508922 1773508922 3.42 0.065
## rank 2 78036070273 39018035137 75.26 < 0.0000000000000002
## discipline 1 18382587698 18382587698 35.46 0.0000000059
## sex:rank 2 176681223 88340612 0.17 0.843
## sex:discipline 1 603606986 603606986 1.16 0.281
## rank:discipline 2 603414910 301707455 0.58 0.559
## sex:rank:discipline 2 259444572 129722286 0.25 0.779
## Residuals 383 198554336484 518418633
##
## yrs.since.phd ***
## sex .
## rank ***
## discipline ***
## sex:rank
## sex:discipline
## rank:discipline
## sex:rank:discipline
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results3)
results4 <- aov(data=Salaries,salary~yrs.service+sex*rank*discipline)
anova(results4) ## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.service 1 40181763067 40181763067 77.60 < 0.0000000000000002
## sex 1 2336425709 2336425709 4.51 0.034
## rank 2 101329713696 50664856848 97.84 < 0.0000000000000002
## discipline 1 17634683790 17634683790 34.05 0.000000011
## sex:rank 2 230492307 115246153 0.22 0.801
## sex:discipline 1 638249832 638249832 1.23 0.268
## rank:discipline 2 553694580 276847290 0.53 0.586
## sex:rank:discipline 2 240667999 120333999 0.23 0.793
## Residuals 383 198332065952 517838292
##
## yrs.service ***
## sex *
## rank ***
## discipline ***
## sex:rank
## sex:discipline
## rank:discipline
## sex:rank:discipline
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results4)
results5 <- aov(data=Salaries,salary~yrs.since.phd+yrs.service+sex*rank*discipline)
anova(results5) ## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.since.phd 1 63088105862 63088105862 123.19 < 0.0000000000000002
## yrs.service 1 4539434392 4539434392 8.86 0.0031
## sex 1 2051487611 2051487611 4.01 0.0460
## rank 2 74794827306 37397413653 73.02 < 0.0000000000000002
## discipline 1 19498065674 19498065674 38.07 0.0000000017
## sex:rank 2 217410121 108705060 0.21 0.8088
## sex:discipline 1 698079521 698079521 1.36 0.2437
## rank:discipline 2 643096813 321548406 0.63 0.5343
## sex:rank:discipline 2 316875972 158437986 0.31 0.7341
## Residuals 382 195630373659 512121397
##
## yrs.since.phd ***
## yrs.service **
## sex *
## rank ***
## discipline ***
## sex:rank
## sex:discipline
## rank:discipline
## sex:rank:discipline
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results5)From the tables above we can find that the p value of the covariates are smaller than 0.05, and after add covariates, the p value of sex become smaller than 0.05 which shows that the salary is somehow depend on the covariates.
Post-hoc tests
Salaries %>%
group_by(rank,discipline) %>%
anova_test(salary ~ yrs.since.phd+yrs.service+sex)## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 18 x 9
## rank discipline Effect DFn DFd F p `p<.05` ges
## * <fct> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 AsstProf A yrs.since.p~ 1 20 3.16 9.10e-2 "" 1.36e-1
## 2 AsstProf A yrs.service 1 20 3.24 8.70e-2 "" 1.39e-1
## 3 AsstProf A sex 1 20 0.009 9.25e-1 "" 4.60e-4
## 4 AsstProf B yrs.since.p~ 1 39 2.76 1.05e-1 "" 6.60e-2
## 5 AsstProf B yrs.service 1 39 7.61 9.00e-3 "*" 1.63e-1
## 6 AsstProf B sex 1 39 0.017 8.98e-1 "" 4.30e-4
## 7 AssocProf A yrs.since.p~ 1 22 3 9.70e-2 "" 1.20e-1
## 8 AssocProf A yrs.service 1 22 4.58 4.40e-2 "*" 1.72e-1
## 9 AssocProf A sex 1 22 3.94 6.00e-2 "" 1.52e-1
## 10 AssocProf B yrs.since.p~ 1 33 0.044 8.36e-1 "" 1.00e-3
## 11 AssocProf B yrs.service 1 33 0.606 4.42e-1 "" 1.80e-2
## 12 AssocProf B sex 1 33 0.396 5.34e-1 "" 1.20e-2
## 13 Prof A yrs.since.p~ 1 127 13.6 3.27e-4 "*" 9.70e-2
## 14 Prof A yrs.service 1 127 17.7 4.74e-5 "*" 1.23e-1
## 15 Prof A sex 1 127 3.38 6.80e-2 "" 2.60e-2
## 16 Prof B yrs.since.p~ 1 131 0.254 6.15e-1 "" 2.00e-3
## 17 Prof B yrs.service 1 131 1.71 1.93e-1 "" 1.30e-2
## 18 Prof B sex 1 131 0.014 9.07e-1 "" 1.05e-4
pwc <- Salaries %>%
group_by(rank,discipline) %>%
emmeans_test(
salary ~ sex, covariate = yrs.service,
p.adjust.method = "bonferroni"
)
pwc## # A tibble: 6 x 11
## rank discipline term .y. group1 group2 df statistic p p.adj
## * <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AssocPr~ A yrs.servi~ sala~ Female Male 383 -1.03 0.305 0.305
## 2 AssocPr~ B yrs.servi~ sala~ Female Male 383 0.298 0.766 0.766
## 3 AsstProf A yrs.servi~ sala~ Female Male 383 -0.124 0.902 0.902
## 4 AsstProf B yrs.servi~ sala~ Female Male 383 -0.0398 0.968 0.968
## 5 Prof A yrs.servi~ sala~ Female Male 383 -1.41 0.159 0.159
## 6 Prof B yrs.servi~ sala~ Female Male 383 -0.269 0.788 0.788
## # ... with 1 more variable: p.adj.signif <chr>
pwc <- Salaries %>%
group_by(rank,discipline) %>%
emmeans_test(
salary ~ sex, covariate = yrs.since.phd,
p.adjust.method = "bonferroni"
)
pwc## # A tibble: 6 x 11
## rank discipline term .y. group1 group2 df statistic p p.adj
## * <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AssocP~ A yrs.since.~ sala~ Female Male 383 -1.05 0.295 0.295
## 2 AssocP~ B yrs.since.~ sala~ Female Male 383 0.319 0.750 0.750
## 3 AsstPr~ A yrs.since.~ sala~ Female Male 383 -0.118 0.906 0.906
## 4 AsstPr~ B yrs.since.~ sala~ Female Male 383 -0.0542 0.957 0.957
## 5 Prof A yrs.since.~ sala~ Female Male 383 -1.29 0.199 0.199
## 6 Prof B yrs.since.~ sala~ Female Male 383 -0.181 0.856 0.856
## # ... with 1 more variable: p.adj.signif <chr>
From the tables we got above we find the covariates truly have somehow influence on the significant difference between groups, but not for all groups. And its influence can be ignored in my opinion.
Summary
the p value of the covariates are smaller than 0.05, and after add covariates, the p value of sex become smaller than 0.05 which shows that the salary is somehow depend on the covariates.
However, from the tables we got in further research we find the covariates truly have somehow influence on the significant difference between groups, but not for all groups. And its influence can be ignored in my opinion.
Task 3: MANOVA
data("Salaries")normality
Salaries %>%
group_by(rank) %>%
shapiro_test(yrs.since.phd)## # A tibble: 3 x 4
## rank variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 AsstProf yrs.since.phd 0.936 0.00192
## 2 AssocProf yrs.since.phd 0.727 0.00000000135
## 3 Prof yrs.since.phd 0.971 0.0000304
Salaries %>%
group_by(rank) %>%
shapiro_test(yrs.service)## # A tibble: 3 x 4
## rank variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 AsstProf yrs.service 0.934 1.44e- 3
## 2 AssocProf yrs.service 0.691 2.60e-10
## 3 Prof yrs.service 0.978 4.71e- 4
ggqqplot(Salaries, "yrs.since.phd") +
facet_grid(~rank)ggqqplot(Salaries, "yrs.service") +
facet_grid(~rank) It is not normality. But for practice in further research, I igore it. (Or I need delete a lot of variables or perhaps I can do some log(), but not too much samples in AssocProf which we can get in task 1) ### Homogeneity of variance
This can be checked using the Levene’s test:
levene_test(yrs.since.phd~rank, data = Salaries)## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 2 394 35.2 8.91e-15
levene_test(yrs.service~rank, data = Salaries)## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 2 394 39.5 2.37e-16
The p<0.05 which is not good,
But we can see in the plot. Somehow it is not bad because the line is almost horizontal.But actually it not meet the requirement of MANOVA
model<-lm(Salaries$yrs.since.phd~rank,data=Salaries)
plot(model, 1)model<-lm(Salaries$yrs.service~rank,data=Salaries)
plot(model, 1)MANOVA
I use two different way to show the results.
res.man <- manova(cbind(Salaries$yrs.since.phd, Salaries$yrs.service) ~ Salaries$rank, data = Salaries)
summary(res.man)## Df Pillai approx F num Df den Df Pr(>F)
## Salaries$rank 2 0.499 65.4 4 788 <0.0000000000000002 ***
## Residuals 394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(res.man)## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## Salaries$rank 2 32390 16195 191 <0.0000000000000002 ***
## Residuals 394 33376 85
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## Salaries$rank 2 24812 12406 116 <0.0000000000000002 ***
## Residuals 394 42175 107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary
We can find the p value is smaller than 0.05 which means significant difference between them in different rank.
But actually the data doesn’t meet the requirement of normal or homogeneity of variance.