3-Way ANOVA

The Salaries dataset from the carData consists of nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. 1. Perform the 3-way Anova with and w/o interactions. Interpret the results. 2. Can years since doctorate (yrs.since.phd), length of service (yrs.service) be significant as covariates? 3. Is there any significant difference in years since PhD (yrs.since.phd) and seniority (yrs.service) of different rank professors?

Descriptice Statistics

data("Salaries")
summary(Salaries)
##         rank     discipline yrs.since.phd    yrs.service        sex     
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545
ggplot(Salaries, aes(x = yrs.service, y = salary, fill = rank)) +
  geom_boxplot() +
  facet_grid(. ~ sex) +
  labs(title = "Distribution of Salary by Rank, Years of Service and Gender") +
  theme_classic()

Assumptions

Outliers

Salaries %>%
  group_by(sex) %>%
  identify_outliers(salary)
## # A tibble: 3 × 8
##   sex   rank  discipline yrs.since.phd yrs.service salary is.outlier is.extreme
##   <fct> <fct> <fct>              <int>       <int>  <int> <lgl>      <lgl>     
## 1 Male  Prof  B                     38          38 231545 TRUE       FALSE     
## 2 Male  Prof  A                     29           7 204000 TRUE       FALSE     
## 3 Male  Prof  A                     43          43 205500 TRUE       FALSE

We can identify 3 outliers, none of them is an extreme, so we decided to continue with analyse without removing those outliers

Normality

shapiro <- shapiro.test(Salaries$salary)
shapiro
## 
##  Shapiro-Wilk normality test
## 
## data:  Salaries$salary
## W = 0.95988, p-value = 6.076e-09

From the Shapiro-Wilk normality test we can observe that the data is not normal, because our p-value is equal to 0.000000006

Q-Q Plots

ggqqplot(Salaries, "salary")

ggqqplot(Salaries, "salary", color = "sex", pallete = c("#00FFFF", "#C70039"))

ggqqplot(Salaries, "salary") +
  facet_grid(sex ~ rank)

From those Q-Q Plots we can observe that the results of Shapiro-Wilk normality test are correct, and the data is not normal

Create a model for 3-way ANOVA with interactions

model_with <- rlm(salary ~ sex * rank * discipline, data = Salaries)
summary(model_with)
## 
## Call: rlm(formula = salary ~ sex * rank * discipline, data = Salaries)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -64325.4 -11577.9   -639.9  11763.1  99660.6 
## 
## Coefficients:
##                                   Value       Std. Error  t value    
## (Intercept)                        72933.3333   8285.1775      8.8029
## sexMale                             1336.2778   9566.8989      0.1397
## rankAssocProf                       -804.8333  13100.0158     -0.0614
## rankProf                           36139.7342  10960.2596      3.2973
## disciplineB                        11256.4667  12288.9041      0.9160
## sexMale:rankAssocProf              11584.0859  14601.8128      0.7933
## sexMale:rankProf                    6358.9751  12097.8169      0.5256
## sexMale:disciplineB                 -878.9988  13591.8029     -0.0647
## rankAssocProf:disciplineB          17033.5358  17961.8367      0.9483
## rankProf:disciplineB               11146.8426  15610.4704      0.7141
## sexMale:rankAssocProf:disciplineB -10881.9812  19696.1782     -0.5525
## sexMale:rankProf:disciplineB       -6408.1889  16853.7872     -0.3802
## 
## Residual standard error: 17440 on 385 degrees of freedom

Homogenity of variance

levene <- leveneTest(model_with)
levene
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group  11   9.047 2.064e-14 ***
##       385                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is < 0.05, which is significant. This means that, there is a significant difference between variances across groups. Therefore, we cannot assume homogeneity of variances in different groups

Spread-Location Plot

ggplot(data = augment(model_with), aes(.fitted, .resid)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Spread-Location Plot") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ANOVA

anova_with <- aov(model_with)
summary(anova_with)
##                      Df    Sum Sq   Mean Sq F value   Pr(>F)    
## sex                   1 6.980e+09 6.980e+09  13.460 0.000278 ***
## rank                  2 1.371e+11 6.855e+10 132.185  < 2e-16 ***
## discipline            1 1.828e+10 1.828e+10  35.257 6.46e-09 ***
## sex:rank              2 2.352e+08 1.176e+08   0.227 0.797203    
## sex:discipline        1 4.558e+08 4.558e+08   0.879 0.349069    
## rank:discipline       2 4.748e+08 2.374e+08   0.458 0.632997    
## sex:rank:discipline   2 1.324e+08 6.620e+07   0.128 0.880195    
## Residuals           385 1.996e+11 5.186e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a non-significant 3-way interaction of sex, rank, and discipline on salary, non-significant 2-way interactions of sex, and rank; sex, and discipline; and rank, and discipline. But sex, rank, and discipline each on their own have a significant interatcion with the salary. Because we have non-significant 3-way interaction, so we have to test if there are any significant 2-way interactions from ANOVA output.

Create a 3-way ANOVA without interactions

model_without <- rlm(salary ~ sex + rank + discipline, data = Salaries)
summary(model_without)
## 
## Call: rlm(formula = salary ~ sex + rank + discipline, data = Salaries)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -64307.1 -11866.1   -599.5  12467.4  99678.9 
## 
## Coefficients:
##               Value      Std. Error t value   
## (Intercept)   68367.1896  3992.4247    17.1242
## sexMale        3315.3394  3442.2665     0.9631
## rankAssocProf 13798.4661  3530.3459     3.9085
## rankProf      45166.6173  2793.8664    16.1663
## disciplineB   15016.9835  2046.8952     7.3365
## 
## Residual standard error: 18340 on 392 degrees of freedom

ANOVA

anova_without <- aov(model_without)
summary(anova_without)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## sex           1 6.980e+09 6.980e+09   13.62 0.000256 ***
## rank          2 1.371e+11 6.855e+10  133.72  < 2e-16 ***
## discipline    1 1.828e+10 1.828e+10   35.67 5.25e-09 ***
## Residuals   392 2.009e+11 5.126e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the ANOVA without interactions we got same results, as for the ANOVA with interactions - the sex, rank, discipline on their own are significant when it comes to a salary.

Checking significance of covariates

Creating a model

model_ancova <- rlm(salary ~ sex * rank * discipline + yrs.since.phd + yrs.service, data = Salaries)
summary(model_ancova)
## 
## Call: rlm(formula = salary ~ sex * rank * discipline + yrs.since.phd + 
##     yrs.service, data = Salaries)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -63691.4 -11134.4   -767.7  11097.3 100814.0 
## 
## Coefficients:
##                                   Value       Std. Error  t value    
## (Intercept)                        72104.5041   8334.3492      8.6515
## sexMale                              941.4684   9593.6415      0.0981
## rankAssocProf                      -1381.8494  13224.9710     -0.1045
## rankProf                           32700.8595  11353.4390      2.8803
## disciplineB                        10711.1255  12324.2418      0.8691
## yrs.since.phd                        325.5432    219.3625      1.4840
## yrs.service                         -297.8519    193.0934     -1.5425
## sexMale:rankAssocProf              11523.4316  14638.9841      0.7872
## sexMale:rankProf                    8638.2335  12186.3798      0.7088
## sexMale:disciplineB                   97.6648  13640.9014      0.0072
## rankAssocProf:disciplineB          17364.1048  18022.0699      0.9635
## rankProf:disciplineB               14653.8549  15747.8378      0.9305
## sexMale:rankAssocProf:disciplineB -10986.3153  19750.6521     -0.5563
## sexMale:rankProf:disciplineB      -10168.9527  16988.2174     -0.5986
## 
## Residual standard error: 16510 on 383 degrees of freedom

Spread-Location Plot

ggplot(data = augment(model_ancova), aes(.fitted, .resid)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Spread-Location Plot") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ANCOVA

ancova <- aov(model_ancova)
summary(ancova)
##                      Df    Sum Sq   Mean Sq F value   Pr(>F)    
## sex                   1 6.980e+09 6.980e+09  13.602 0.000259 ***
## rank                  2 1.371e+11 6.855e+10 133.581  < 2e-16 ***
## discipline            1 1.828e+10 1.828e+10  35.630 5.44e-09 ***
## yrs.since.phd         1 1.185e+08 1.185e+08   0.231 0.631081    
## yrs.service           1 2.710e+09 2.710e+09   5.281 0.022095 *  
## sex:rank              2 2.585e+08 1.292e+08   0.252 0.777482    
## sex:discipline        1 5.339e+08 5.339e+08   1.041 0.308346    
## rank:discipline       2 5.341e+08 2.671e+08   0.520 0.594664    
## sex:rank:discipline   2 2.559e+08 1.280e+08   0.249 0.779430    
## Residuals           383 1.965e+11 5.131e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANCOVA we can observe that the 3-way interaction between sex, rank, and discipline is not significant on the salary, because the p-value is equal to 0.77943. It is also a case for the 2-way interactions between sex-rank, sex-discipline, and rank-discipline, where the p-value is equal to 0.77748, 0.30835, and 0.59466 respectively. The yrs.service p-value is 0.02209, which means that it is a significant factor when it comes to the salary, but not as significant as the sex, rank, or discipline on their own. On the other hand, the yrs.since.phd p-value is 0.63108, which means that it is not a significant factor of value of salary.

Checking significane of difference in years since PhD and seniority of different rank professors

RLM for years since PhD and rank

Robust Linear Model (rlm) to assess if there are any potential significant differences in this variable based on rank.

rank_colors <- c("red", "green", "blue")

boxplot(yrs.since.phd ~ rank, data = Salaries, main = "Years Since PhD by Rank", 
        xlab = "Rank", ylab = "Years Since PhD", col = rank_colors)

model_phd <- rlm(yrs.since.phd ~ rank, data = Salaries)
summary(model_phd)
## 
## Call: rlm(formula = yrs.since.phd ~ rank, data = Salaries)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7577  -4.8558  -0.7577   5.2423  35.1442 
## 
## Coefficients:
##               Value   Std. Error t value
## (Intercept)    5.1045  1.1389     4.4818
## rankAssocProf  8.7513  1.6295     5.3706
## rankProf      22.6533  1.2743    17.7766
## 
## Residual standard error: 7.771 on 394 degrees of freedom

RLM for years if service and rank

Robust Linear Model (rlm) to investigate potential significant variations between years of service across different professor ranks

boxplot(yrs.service ~ rank, data = Salaries, main = "Years of Service by Rank", 
        xlab = "Rank", ylab = "Years of Service", col = rank_colors)
legend("topright", legend = levels(Salaries$rank), fill = rank_colors, title = "Rank")

model_service <- rlm(yrs.service ~ rank, data = Salaries)
summary(model_service)
## 
## Call: rlm(formula = yrs.service ~ rank, data = Salaries)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9424  -3.9424  -0.3731   5.0576  42.8597 
## 
## Coefficients:
##               Value   Std. Error t value
## (Intercept)    2.3731  1.1651     2.0369
## rankAssocProf  7.7671  1.6669     4.6597
## rankProf      19.5693  1.3036    15.0119
## 
## Residual standard error: 7.329 on 394 degrees of freedom
manova_result <- manova(cbind(model_phd$residuals, model_service$residuals) ~ rank, data = Salaries)
summary(manova_result)
##            Df    Pillai approx F num Df den Df Pr(>F)
## rank        2 0.0032234  0.31802      4    788  0.866
## Residuals 394

From the MANOVA we can see that there is no significant difference in years since PhD and seniority of different rank professors, which we conclude from the p-value that is equal to 0.866, and is far from the signicance level value of 0.05