Assignment

Task 1: 3-way Anova with/without interaction

Descriptive statistics

ggplot(Salaries, aes(x=salary)) + 
  geom_histogram(bins=10) + 
  facet_grid(sex ~ discipline+rank)

bxp <- ggboxplot(Salaries, y="salary", x="sex", color ="discipline") + facet_wrap(~rank)
bxp

The distributions within each cell look pretty wonky:

xtabs(~ sex + rank + discipline, data = Salaries)
## , , discipline = A
## 
##         rank
## sex      AsstProf AssocProf Prof
##   Female        6         4    8
##   Male         18        22  123
## 
## , , discipline = B
## 
##         rank
## sex      AsstProf AssocProf Prof
##   Female        5         6   10
##   Male         38        32  125

Outlier

Identify outlier in each cell design:

Salaries %>%
  group_by(sex, rank, discipline) %>%
  identify_outliers(salary)
## # A tibble: 18 x 8
##    rank  discipline sex   yrs.since.phd yrs.service salary is.outlier is.extreme
##    <fct> <fct>      <fct>         <int>       <int>  <int> <lgl>      <lgl>     
##  1 Asst~ A          Fema~             7           6  63100 TRUE       FALSE     
##  2 Asso~ A          Fema~            25          22  62884 TRUE       FALSE     
##  3 Asso~ B          Fema~            14           7 109650 TRUE       TRUE      
##  4 Asso~ B          Fema~            12           9  71065 TRUE       TRUE      
##  5 Asst~ A          Male              3           1  63900 TRUE       FALSE     
##  6 Asst~ A          Male              2           0  85000 TRUE       TRUE      
##  7 Asst~ A          Male              8           4  81035 TRUE       FALSE     
##  8 Asso~ A          Male             14           8 100102 TRUE       FALSE     
##  9 Asso~ A          Male              9           7  70000 TRUE       FALSE     
## 10 Asso~ A          Male             11           1 104800 TRUE       FALSE     
## 11 Asso~ A          Male             45          39  70700 TRUE       FALSE     
## 12 Asso~ A          Male             10           1 108413 TRUE       FALSE     
## 13 Asso~ A          Male             11           8 104121 TRUE       FALSE     
## 14 Asso~ B          Male             13          11 126431 TRUE       FALSE     
## 15 Prof  A          Male             29           7 204000 TRUE       FALSE     
## 16 Prof  A          Male             42          18 194800 TRUE       FALSE     
## 17 Prof  A          Male             43          43 205500 TRUE       FALSE     
## 18 Prof  B          Male             38          38 231545 TRUE       FALSE

There are some extreme outliers but for they are in the group which has had small samples I will ignore them for further researching.

normality

Check normality assumption by analyzing the model residuals. QQ plot and Shapiro-Wilk test of normality are used.

Salaries %>%
  group_by(sex, rank, discipline) %>%
  shapiro_test(salary)
## # A tibble: 12 x 6
##    rank      discipline sex    variable statistic        p
##    <fct>     <fct>      <fct>  <chr>        <dbl>    <dbl>
##  1 AsstProf  A          Female salary       0.870 0.226   
##  2 AsstProf  B          Female salary       0.889 0.354   
##  3 AssocProf A          Female salary       0.863 0.269   
##  4 AssocProf B          Female salary       0.635 0.00117 
##  5 Prof      A          Female salary       0.934 0.549   
##  6 Prof      B          Female salary       0.974 0.923   
##  7 AsstProf  A          Male   salary       0.941 0.300   
##  8 AsstProf  B          Male   salary       0.941 0.0458  
##  9 AssocProf A          Male   salary       0.878 0.0113  
## 10 AssocProf B          Male   salary       0.967 0.416   
## 11 Prof      A          Male   salary       0.952 0.000259
## 12 Prof      B          Male   salary       0.978 0.0435

From the normality test, we find most are in normality. We can see it more clearly in plot below, which showed most of them closed to the line, except for there is one in the B assocProf female.

ggqqplot(Salaries, "salary") +
  facet_grid(sex~discipline+rank)

That is also the extreme outlier, we can delete it.

Salaries= Salaries[-317,]

Homogeneity of variance

This can be checked using the Levene’s test:

levene_test(salary~sex*discipline*rank, data = Salaries)
## # A tibble: 1 x 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1    11   384      9.34 6.49e-15
#bartlett.test(salary ~ interaction(sex,discipline,rank), data=Salaries)

The p<0.05 show that there is significant difference between variances across groups.

Actually I should say it doesn’t meet requirement of anova, and if it is one-way ANOVA, we can use Welch one-way ANOVA. But this is 3-way anova.

But we can see in the plot.

model<-lm(salary~sex*discipline*rank,data=Salaries)
plot(model, 1)

In the plot above, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. So, we can assume the homogeneity of variances.

ANOVA

interaraciton

results1 <- aov(data=Salaries,salary~rank*sex*discipline)
anova(results1)
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## rank                  2 141937501497 70968750748  137.16 < 0.0000000000000002
## sex                   1    672768654   672768654    1.30                 0.25
## discipline            1  18551357960 18551357960   35.86         0.0000000049
## rank:sex              2    193942177    96971088    0.19                 0.83
## rank:discipline       2    569416536   284708268    0.55                 0.58
## sex:discipline        1    621295022   621295022    1.20                 0.27
## rank:sex:discipline   2    250701313   125350656    0.24                 0.78
## Residuals           384 198680773772   517397848                             
##                        
## rank                ***
## sex                    
## discipline          ***
## rank:sex               
## rank:discipline        
## sex:discipline         
## rank:sex:discipline    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results1)

The ANOVA (formula: salary ~ rank * sex * discipline) suggests that:

  • The main effect of rank is statistically significant and large (F(2, 384) = 137.16, p < .001; Eta2 (partial) = 0.42, 95% CI [0.36, 1.00])
  • The main effect of sex is statistically not significant and very small (F(1, 384) = 1.30, p = 0.255; Eta2 (partial) = 3.37e-03, 95% CI [0.00, 1.00])
  • The main effect of discipline is statistically significant and medium (F(1, 384) = 35.86, p < .001; Eta2 (partial) = 0.09, 95% CI [0.05, 1.00])
  • The interaction between rank and sex is statistically not significant and very small (F(2, 384) = 0.19, p = 0.829; Eta2 (partial) = 9.75e-04, 95% CI [0.00, 1.00])
  • The interaction between rank and discipline is statistically not significant and very small (F(2, 384) = 0.55, p = 0.577; Eta2 (partial) = 2.86e-03, 95% CI [0.00, 1.00])
  • The interaction between sex and discipline is statistically not significant and very small (F(1, 384) = 1.20, p = 0.274; Eta2 (partial) = 3.12e-03, 95% CI [0.00, 1.00])
  • The interaction between rank, sex and discipline is statistically not significant and very small (F(2, 384) = 0.24, p = 0.785; Eta2 (partial) = 1.26e-03, 95% CI [0.00, 1.00])

Effect sizes were labelled following Field’s (2013) recommendations.

without interaction

results2 <- aov(data=Salaries,salary~rank+sex+discipline)
anova(results2)
## Analysis of Variance Table
## 
## Response: salary
##             Df       Sum Sq     Mean Sq F value               Pr(>F)    
## rank         2 141937501497 70968750748  138.52 < 0.0000000000000002 ***
## sex          1    672768654   672768654    1.31                 0.25    
## discipline   1  18551357960 18551357960   36.21         0.0000000041 ***
## Residuals  391 200316128820   512317465                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep2<-report(results2)

The ANOVA (formula: salary ~ rank + sex + discipline) suggests that:

  • The main effect of rank is statistically significant and large (F(2, 391) = 138.52, p < .001; Eta2 (partial) = 0.41, 95% CI [0.36, 1.00])
  • The main effect of sex is statistically not significant and very small (F(1, 391) = 1.31, p = 0.253; Eta2 (partial) = 3.35e-03, 95% CI [0.00, 1.00])
  • The main effect of discipline is statistically significant and medium (F(1, 391) = 36.21, p < .001; Eta2 (partial) = 0.08, 95% CI [0.05, 1.00])

Effect sizes were labelled following Field’s (2013) recommendations.

main effects

Salaries %>%
  group_by(discipline) %>%
  anova_test(salary ~ rank, error = model)
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 2 x 8
##   discipline Effect   DFn   DFd     F        p `p<.05`   ges
## * <fct>      <chr>  <dbl> <dbl> <dbl>    <dbl> <chr>   <dbl>
## 1 A          rank       2   384  60.6 1.38e-23 *        0.24
## 2 B          rank       2   384  86.3 1.10e-31 *        0.31
Salaries %>%
  group_by(rank) %>%
  anova_test(salary ~ discipline, error = model)
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 3 x 8
##   rank      Effect       DFn   DFd     F          p `p<.05`   ges
## * <fct>     <chr>      <dbl> <dbl> <dbl>      <dbl> <chr>   <dbl>
## 1 AsstProf  discipline     1   384  3.38 0.067      ""      0.009
## 2 AssocProf discipline     1   384 10.7  0.001      "*"     0.027
## 3 Prof      discipline     1   384 23.2  0.00000207 "*"     0.057

Comparing the result above, we can see that discipline is the main effect.

summary

It has some extreme outliers, but I ignored.

It is not normality from the p value, but from plot it is almost along the line except for one point. And I delete that point in later test. There is significant difference between variances across groups. But from the plot we can assume it meet requirement to do further research, otherwise we should give up it.

The anova result shows that there are significant difference by rank or discipline, but not by sex. And when cosider interaction between them, the p value also greater than 0.05, which cannot reject H0 which there is no significant difference.

Last, the discipline is the main effect.

Task 2: ANCOVA (based on task 1)

For research the influence brought by covariates, I do the ANCOVA which we can reasearch whether there is influence when we take other 2 numeric variables into consideration.

ANCOVA

I did three different ancov test below, one is years since PhD (yrs.since.phd), one is seniority (yrs.service) and both of them be taken into considerarion.

results3 <- aov(data=Salaries,salary~yrs.since.phd+sex*rank*discipline)
anova(results3)
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## yrs.since.phd         1  63088105862 63088105862  121.69 < 0.0000000000000002
## sex                   1   1773508922  1773508922    3.42                0.065
## rank                  2  78036070273 39018035137   75.26 < 0.0000000000000002
## discipline            1  18382587698 18382587698   35.46         0.0000000059
## sex:rank              2    176681223    88340612    0.17                0.843
## sex:discipline        1    603606986   603606986    1.16                0.281
## rank:discipline       2    603414910   301707455    0.58                0.559
## sex:rank:discipline   2    259444572   129722286    0.25                0.779
## Residuals           383 198554336484   518418633                             
##                        
## yrs.since.phd       ***
## sex                 .  
## rank                ***
## discipline          ***
## sex:rank               
## sex:discipline         
## rank:discipline        
## sex:rank:discipline    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results3)

results4 <- aov(data=Salaries,salary~yrs.service+sex*rank*discipline)
anova(results4) 
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## yrs.service           1  40181763067 40181763067   77.60 < 0.0000000000000002
## sex                   1   2336425709  2336425709    4.51                0.034
## rank                  2 101329713696 50664856848   97.84 < 0.0000000000000002
## discipline            1  17634683790 17634683790   34.05          0.000000011
## sex:rank              2    230492307   115246153    0.22                0.801
## sex:discipline        1    638249832   638249832    1.23                0.268
## rank:discipline       2    553694580   276847290    0.53                0.586
## sex:rank:discipline   2    240667999   120333999    0.23                0.793
## Residuals           383 198332065952   517838292                             
##                        
## yrs.service         ***
## sex                 *  
## rank                ***
## discipline          ***
## sex:rank               
## sex:discipline         
## rank:discipline        
## sex:rank:discipline    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results4)

results5 <- aov(data=Salaries,salary~yrs.since.phd+yrs.service+sex*rank*discipline)
anova(results5) 
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## yrs.since.phd         1  63088105862 63088105862  123.19 < 0.0000000000000002
## yrs.service           1   4539434392  4539434392    8.86               0.0031
## sex                   1   2051487611  2051487611    4.01               0.0460
## rank                  2  74794827306 37397413653   73.02 < 0.0000000000000002
## discipline            1  19498065674 19498065674   38.07         0.0000000017
## sex:rank              2    217410121   108705060    0.21               0.8088
## sex:discipline        1    698079521   698079521    1.36               0.2437
## rank:discipline       2    643096813   321548406    0.63               0.5343
## sex:rank:discipline   2    316875972   158437986    0.31               0.7341
## Residuals           382 195630373659   512121397                             
##                        
## yrs.since.phd       ***
## yrs.service         ** 
## sex                 *  
## rank                ***
## discipline          ***
## sex:rank               
## sex:discipline         
## rank:discipline        
## sex:rank:discipline    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results5)

From the tables above we can find that the p value of the covariates are smaller than 0.05, and after add covariates, the p value of sex become smaller than 0.05 which shows that the salary is somehow depend on the covariates.

Post-hoc tests

Salaries %>%
  group_by(rank,discipline) %>%
  anova_test(salary ~ yrs.since.phd+yrs.service+sex)
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## Coefficient covariances computed by hccm()
## # A tibble: 18 x 9
##    rank      discipline Effect         DFn   DFd      F        p `p<.05`     ges
##  * <fct>     <fct>      <chr>        <dbl> <dbl>  <dbl>    <dbl> <chr>     <dbl>
##  1 AsstProf  A          yrs.since.p~     1    20  3.16   9.10e-2 ""      1.36e-1
##  2 AsstProf  A          yrs.service      1    20  3.24   8.70e-2 ""      1.39e-1
##  3 AsstProf  A          sex              1    20  0.009  9.25e-1 ""      4.60e-4
##  4 AsstProf  B          yrs.since.p~     1    39  2.76   1.05e-1 ""      6.60e-2
##  5 AsstProf  B          yrs.service      1    39  7.61   9.00e-3 "*"     1.63e-1
##  6 AsstProf  B          sex              1    39  0.017  8.98e-1 ""      4.30e-4
##  7 AssocProf A          yrs.since.p~     1    22  3      9.70e-2 ""      1.20e-1
##  8 AssocProf A          yrs.service      1    22  4.58   4.40e-2 "*"     1.72e-1
##  9 AssocProf A          sex              1    22  3.94   6.00e-2 ""      1.52e-1
## 10 AssocProf B          yrs.since.p~     1    33  0.044  8.36e-1 ""      1.00e-3
## 11 AssocProf B          yrs.service      1    33  0.606  4.42e-1 ""      1.80e-2
## 12 AssocProf B          sex              1    33  0.396  5.34e-1 ""      1.20e-2
## 13 Prof      A          yrs.since.p~     1   127 13.6    3.27e-4 "*"     9.70e-2
## 14 Prof      A          yrs.service      1   127 17.7    4.74e-5 "*"     1.23e-1
## 15 Prof      A          sex              1   127  3.38   6.80e-2 ""      2.60e-2
## 16 Prof      B          yrs.since.p~     1   131  0.254  6.15e-1 ""      2.00e-3
## 17 Prof      B          yrs.service      1   131  1.71   1.93e-1 ""      1.30e-2
## 18 Prof      B          sex              1   131  0.014  9.07e-1 ""      1.05e-4
pwc <- Salaries %>% 
  group_by(rank,discipline) %>%
  emmeans_test(
    salary ~ sex, covariate = yrs.service,
    p.adjust.method = "bonferroni"
    )
pwc
## # A tibble: 6 x 11
##   rank     discipline term       .y.   group1 group2    df statistic     p p.adj
## * <chr>    <chr>      <chr>      <chr> <chr>  <chr>  <dbl>     <dbl> <dbl> <dbl>
## 1 AssocPr~ A          yrs.servi~ sala~ Female Male     383   -1.03   0.305 0.305
## 2 AssocPr~ B          yrs.servi~ sala~ Female Male     383    0.298  0.766 0.766
## 3 AsstProf A          yrs.servi~ sala~ Female Male     383   -0.124  0.902 0.902
## 4 AsstProf B          yrs.servi~ sala~ Female Male     383   -0.0398 0.968 0.968
## 5 Prof     A          yrs.servi~ sala~ Female Male     383   -1.41   0.159 0.159
## 6 Prof     B          yrs.servi~ sala~ Female Male     383   -0.269  0.788 0.788
## # ... with 1 more variable: p.adj.signif <chr>
pwc <- Salaries %>% 
  group_by(rank,discipline) %>%
  emmeans_test(
    salary ~ sex, covariate = yrs.since.phd,
    p.adjust.method = "bonferroni"
    )
pwc
## # A tibble: 6 x 11
##   rank    discipline term        .y.   group1 group2    df statistic     p p.adj
## * <chr>   <chr>      <chr>       <chr> <chr>  <chr>  <dbl>     <dbl> <dbl> <dbl>
## 1 AssocP~ A          yrs.since.~ sala~ Female Male     383   -1.05   0.295 0.295
## 2 AssocP~ B          yrs.since.~ sala~ Female Male     383    0.319  0.750 0.750
## 3 AsstPr~ A          yrs.since.~ sala~ Female Male     383   -0.118  0.906 0.906
## 4 AsstPr~ B          yrs.since.~ sala~ Female Male     383   -0.0542 0.957 0.957
## 5 Prof    A          yrs.since.~ sala~ Female Male     383   -1.29   0.199 0.199
## 6 Prof    B          yrs.since.~ sala~ Female Male     383   -0.181  0.856 0.856
## # ... with 1 more variable: p.adj.signif <chr>

From the tables we got above we find the covariates truly have somehow influence on the significant difference between groups, but not for all groups. And its influence can be ignored in my opinion.

Summary

the p value of the covariates are smaller than 0.05, and after add covariates, the p value of sex become smaller than 0.05 which shows that the salary is somehow depend on the covariates.

However, from the tables we got in further research we find the covariates truly have somehow influence on the significant difference between groups, but not for all groups. And its influence can be ignored in my opinion.

Task 3: MANOVA

data("Salaries")

normality

Salaries %>%
  group_by(rank) %>%
  shapiro_test(yrs.since.phd)
## # A tibble: 3 x 4
##   rank      variable      statistic             p
##   <fct>     <chr>             <dbl>         <dbl>
## 1 AsstProf  yrs.since.phd     0.936 0.00192      
## 2 AssocProf yrs.since.phd     0.727 0.00000000135
## 3 Prof      yrs.since.phd     0.971 0.0000304
Salaries %>%
  group_by(rank) %>%
  shapiro_test(yrs.service)
## # A tibble: 3 x 4
##   rank      variable    statistic        p
##   <fct>     <chr>           <dbl>    <dbl>
## 1 AsstProf  yrs.service     0.934 1.44e- 3
## 2 AssocProf yrs.service     0.691 2.60e-10
## 3 Prof      yrs.service     0.978 4.71e- 4
ggqqplot(Salaries, "yrs.since.phd") +
  facet_grid(~rank)

ggqqplot(Salaries, "yrs.service") +
  facet_grid(~rank)

It is not normality. But for practice in further research, I igore it. (Or I need delete a lot of variables or perhaps I can do some log(), but not too much samples in AssocProf which we can get in task 1) ### Homogeneity of variance

This can be checked using the Levene’s test:

levene_test(yrs.since.phd~rank, data = Salaries)
## # A tibble: 1 x 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1     2   394      35.2 8.91e-15
levene_test(yrs.service~rank, data = Salaries)
## # A tibble: 1 x 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1     2   394      39.5 2.37e-16

The p<0.05 which is not good,

But we can see in the plot. Somehow it is not bad because the line is almost horizontal.But actually it not meet the requirement of MANOVA

model<-lm(Salaries$yrs.since.phd~rank,data=Salaries)
plot(model, 1)

model<-lm(Salaries$yrs.service~rank,data=Salaries)
plot(model, 1)

MANOVA

I use two different way to show the results.

res.man <- manova(cbind(Salaries$yrs.since.phd, Salaries$yrs.service) ~ Salaries$rank, data = Salaries)
summary(res.man)
##                Df Pillai approx F num Df den Df              Pr(>F)    
## Salaries$rank   2  0.499     65.4      4    788 <0.0000000000000002 ***
## Residuals     394                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(res.man)
##  Response 1 :
##                Df Sum Sq Mean Sq F value              Pr(>F)    
## Salaries$rank   2  32390   16195     191 <0.0000000000000002 ***
## Residuals     394  33376      85                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                Df Sum Sq Mean Sq F value              Pr(>F)    
## Salaries$rank   2  24812   12406     116 <0.0000000000000002 ***
## Residuals     394  42175     107                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary

We can find the p value is smaller than 0.05 which means significant difference between them in different rank.

But actually the data doesn’t meet the requirement of normal or homogeneity of variance.