R Markdown

3-Way ANOVA

Descriptive statistics

Salaries %>%
  group_by(sex, rank, discipline) %>%
  get_summary_stats(salary, type = "common")
## # A tibble: 12 × 13
##    rank    disci…¹ sex   varia…²     n    min    max median    iqr   mean     sd
##    <fct>   <fct>   <fct> <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 AsstPr… A       Fema… salary      6  63100  78500  73000  4000  7.29e4  5463.
##  2 AsstPr… B       Fema… salary      5  74692  97032  80225 15000  8.42e4  9792.
##  3 AssocP… A       Fema… salary      4  62884  77500  74065  4802. 7.21e4  6403.
##  4 AssocP… B       Fema… salary      6  71065 109650 103872   758. 9.94e4 14086.
##  5 Prof    A       Fema… salary      8  90450 137000 109800 15226. 1.10e5 15095.
##  6 Prof    B       Fema… salary     10 105450 161101 128256 20214. 1.32e5 17504.
##  7 AsstPr… A       Male  salary     18  63900  85000  74000  3086  7.43e4  4580.
##  8 AsstPr… B       Male  salary     38  68404  95079  85408 10906. 8.46e4  6900.
##  9 AssocP… A       Male  salary     22  70000 108413  82350  6835  8.50e4 10612.
## 10 AssocP… B       Male  salary     32  83900 126431 100730 10231  1.02e5  9608.
## 11 Prof    A       Male  salary    123  57800 205500 113278 34290  1.21e5 28505.
## 12 Prof    B       Male  salary    125  67559 231545 132825 35876  1.34e5 26514.
## # … with 2 more variables: se <dbl>, ci <dbl>, and abbreviated variable names
## #   ¹​discipline, ²​variable
ggplot(Salaries, aes(x=salary,fill = rank)) + 
  geom_histogram(bins=15) + 
  facet_grid(sex ~ discipline) + 
  labs(title = "Dependency of salary in relation to other factors",
         x = "Salary",
         y = "Number of observations") +
  theme_light()

ggboxplot(Salaries, x = "discipline", y = "salary", 
  color = "rank", palette = "aas", facet.by = "sex")+
  geom_jitter(aes(color = discipline)) +
  labs(title = "Dependency of salary in relation to other factors",
         x = "discipline",
         y = "Salary") +
  theme_light()

Assumptions

Outliers

outliers<-Salaries %>%
  group_by(sex, rank, discipline) %>%
  identify_outliers(salary)

There are 18 outliers in the dataset out of 397 observations. Moreover, there are 3 extreme ones.

Normality

Salaries %>%
  group_by(discipline, rank, sex) %>%
  shapiro_test(salary)
## # A tibble: 12 × 6
##    rank      discipline sex    variable statistic        p
##    <fct>     <fct>      <fct>  <chr>        <dbl>    <dbl>
##  1 AsstProf  A          Female salary       0.870 0.226   
##  2 AsstProf  A          Male   salary       0.941 0.300   
##  3 AssocProf A          Female salary       0.863 0.269   
##  4 AssocProf A          Male   salary       0.878 0.0113  
##  5 Prof      A          Female salary       0.934 0.549   
##  6 Prof      A          Male   salary       0.952 0.000259
##  7 AsstProf  B          Female salary       0.889 0.354   
##  8 AsstProf  B          Male   salary       0.941 0.0458  
##  9 AssocProf B          Female salary       0.635 0.00117 
## 10 AssocProf B          Male   salary       0.967 0.416   
## 11 Prof      B          Female salary       0.974 0.923   
## 12 Prof      B          Male   salary       0.978 0.0435
ggqqplot(Salaries, "salary", ggtheme = theme_light()) +
  facet_grid(sex + rank ~ discipline)+ 
  labs(title = "Normality of salary  in relation to other factors")

Not every group has passed the normality test, probably because of the extreme outliers. There were 5 groups that didn’t passed, so we need to remove outliers from our dataset.

Remove outliers

Salaries$salary[Salaries$salary%in%outliers$salary] <- NA
Salaries <- na.omit(Salaries)

Normality

Salaries %>%
  group_by(discipline, rank, sex) %>%
  shapiro_test(salary)
## # A tibble: 12 × 6
##    rank      discipline sex    variable statistic       p
##    <fct>     <fct>      <fct>  <chr>        <dbl>   <dbl>
##  1 AsstProf  A          Female salary       0.813 0.103  
##  2 AsstProf  A          Male   salary       0.953 0.581  
##  3 AssocProf A          Female salary       0.976 0.703  
##  4 AssocProf A          Male   salary       0.899 0.0787 
##  5 Prof      A          Female salary       0.934 0.549  
##  6 Prof      A          Male   salary       0.967 0.00457
##  7 AsstProf  B          Female salary       0.889 0.354  
##  8 AsstProf  B          Male   salary       0.941 0.0458 
##  9 AssocProf B          Female salary       0.916 0.514  
## 10 AssocProf B          Male   salary       0.976 0.698  
## 11 Prof      B          Female salary       0.974 0.923  
## 12 Prof      B          Male   salary       0.987 0.268
ggqqplot(Salaries, "salary", ggtheme = theme_light()) +
  facet_grid(sex + rank ~ discipline)+ 
  labs(title = "Normality of salary in relation to other factors")

All of the groups passed the normality test except two, but it is fine :)

Homogeneity of variance

Salaries %>%
  levene_test(salary~sex*discipline*rank)
## # A tibble: 1 × 4
##     df1   df2 statistic        p
##   <int> <int>     <dbl>    <dbl>
## 1    11   366      10.8 2.39e-17
model<-lm(salary~sex*discipline*rank,data=Salaries)
plot(model, 1)

To test homogeneity of variance we used Levene Test. The p-value is < 0.05 (it is almost 0). This means that, there is a significant difference between variances across groups. Therefore, we cannot assume the homogeneity of variances in the different discipline groups. On the plot, there is a significant relationships between residuals and fitted values.

Improving homogenity

x<-Salaries %>%
  group_by(discipline, rank, sex)
Salaries_log <-x
Salaries_log$salary<-log(x$salary)
shapiro_test(Salaries_log, salary)
## # A tibble: 12 × 6
##    rank      discipline sex    variable statistic      p
##    <fct>     <fct>      <fct>  <chr>        <dbl>  <dbl>
##  1 AsstProf  A          Female salary       0.813 0.104 
##  2 AsstProf  A          Male   salary       0.952 0.560 
##  3 AssocProf A          Female salary       0.978 0.717 
##  4 AssocProf A          Male   salary       0.891 0.0587
##  5 Prof      A          Female salary       0.936 0.575 
##  6 Prof      A          Male   salary       0.985 0.212 
##  7 AsstProf  B          Female salary       0.896 0.387 
##  8 AsstProf  B          Male   salary       0.931 0.0218
##  9 AssocProf B          Female salary       0.916 0.517 
## 10 AssocProf B          Male   salary       0.981 0.840 
## 11 Prof      B          Female salary       0.976 0.939 
## 12 Prof      B          Male   salary       0.986 0.236

We have improved the homogenity by applying the logarithm to Salaries dataset and only one group does not pass the normality test.

Anova

results <- aov(data=x,salary ~ sex*rank*discipline)
anova(results)
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## sex                   1   3795659233  3795659233    8.43               0.0039
## rank                  2 120588185909 60294092955  133.91 < 0.0000000000000002
## discipline            1  20151319596 20151319596   44.76       0.000000000084
## sex:rank              2    198319490    99159745    0.22               0.8024
## sex:discipline        1    220911511   220911511    0.49               0.4841
## rank:discipline       2    622220372   311110186    0.69               0.5017
## sex:rank:discipline   2    153372867    76686433    0.17               0.8435
## Residuals           366 164793375995   450255126                             
##                        
## sex                 ** 
## rank                ***
## discipline          ***
## sex:rank               
## sex:discipline         
## rank:discipline        
## sex:rank:discipline    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
rep<-report(results)

We can see that rank as well as discipline very strongly influence the salary, the sex is also a quite important factor. There is no interaction between these three variables (no stars :( ).

Ancova

anova(aov(data = x, salary ~ yrs.since.phd + rank * discipline * sex))
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## yrs.since.phd         1  50298509853 50298509853  111.41 < 0.0000000000000002
## rank                  2  74563223858 37281611929   82.58 < 0.0000000000000002
## discipline            1  19486108432 19486108432   43.16        0.00000000017
## sex                   1    189041166   189041166    0.42                 0.52
## rank:discipline       2    634244017   317122009    0.70                 0.50
## rank:sex              2    152688244    76344122    0.17                 0.84
## discipline:sex        1    254773152   254773152    0.56                 0.45
## rank:discipline:sex   2    152568823    76284412    0.17                 0.84
## Residuals           365 164792207427   451485500                             
##                        
## yrs.since.phd       ***
## rank                ***
## discipline          ***
## sex                    
## rank:discipline        
## rank:sex               
## discipline:sex         
## rank:discipline:sex    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(aov(data = x, salary ~ yrs.service + rank * discipline * sex))
## Analysis of Variance Table
## 
## Response: salary
##                      Df       Sum Sq     Mean Sq F value               Pr(>F)
## yrs.service           1  33334824501 33334824501   74.00  0.00000000000000023
## rank                  2  92013716406 46006858203  102.14 < 0.0000000000000002
## discipline            1  19288510828 19288510828   42.82  0.00000000020329808
## sex                   1    241485520   241485520    0.54                 0.46
## rank:discipline       2    606493538   303246769    0.67                 0.51
## rank:sex              2    186373233    93186616    0.21                 0.81
## discipline:sex        1    280048048   280048048    0.62                 0.43
## rank:discipline:sex   2    158588069    79294035    0.18                 0.84
## Residuals           365 164413324830   450447465                             
##                        
## yrs.service         ***
## rank                ***
## discipline          ***
## sex                    
## rank:discipline        
## rank:sex               
## discipline:sex         
## rank:discipline:sex    
## Residuals              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By performing Ancova, we can see that yrs.since.phd singificantly affects salary. It affects the salary even more than discipline variable.The same situation applies to yrs.service.

MANOVA

model <- lm(cbind(yrs.since.phd, yrs.service) ~ salary, x)
Manova(model, test.statistic = "Pillai")
## 
## Type II MANOVA Tests: Pillai test statistic
##        Df test stat approx F num Df den Df              Pr(>F)    
## salary  1     0.171     38.7      2    375 0.00000000000000055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value is lower than 0,05 - almost 0, so we know that there is a significant difference in years since PhD and seniority of different rank professors