The Salaries dataset from the carData consists of nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the collegeâs administration to monitor salary differences between male and female faculty members. 1. Perform the 3-way Anova with and w/o interactions. Interpret the results. 2. Can years since doctorate (yrs.since.phd), length of service (yrs.service) be significant as covariates? 3. Is there any significant difference in years since PhD (yrs.since.phd) and seniority (yrs.service) of different rank professors?
data("Salaries")
summary(Salaries)
## rank discipline yrs.since.phd yrs.service sex
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
ggplot(Salaries, aes(x = yrs.service, y = salary, fill = rank)) +
geom_boxplot() +
facet_grid(. ~ sex) +
labs(title = "Distribution of Salary by Rank, Years of Service and Gender") +
theme_classic()
Salaries %>%
group_by(sex) %>%
identify_outliers(salary)
## # A tibble: 3 Ă 8
## sex rank discipline yrs.since.phd yrs.service salary is.outlier is.extreme
## <fct> <fct> <fct> <int> <int> <int> <lgl> <lgl>
## 1 Male Prof B 38 38 231545 TRUE FALSE
## 2 Male Prof A 29 7 204000 TRUE FALSE
## 3 Male Prof A 43 43 205500 TRUE FALSE
We can identify 3 outliers, none of them is an extreme, so we decided to continue with analyse without removing those outliers
shapiro <- shapiro.test(Salaries$salary)
shapiro
##
## Shapiro-Wilk normality test
##
## data: Salaries$salary
## W = 0.95988, p-value = 6.076e-09
From the Shapiro-Wilk normality test we can observe that the data is not normal, because our p-value is equal to 0.000000006
ggqqplot(Salaries, "salary")
ggqqplot(Salaries, "salary", color = "sex", pallete = c("#00FFFF", "#C70039"))
ggqqplot(Salaries, "salary") +
facet_grid(sex ~ rank)
From those Q-Q Plots we can observe that the results of Shapiro-Wilk
normality test are correct, and the data is not normal
model_with <- rlm(salary ~ sex * rank * discipline, data = Salaries)
summary(model_with)
##
## Call: rlm(formula = salary ~ sex * rank * discipline, data = Salaries)
## Residuals:
## Min 1Q Median 3Q Max
## -64325.4 -11577.9 -639.9 11763.1 99660.6
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 72933.3333 8285.1775 8.8029
## sexMale 1336.2778 9566.8989 0.1397
## rankAssocProf -804.8333 13100.0158 -0.0614
## rankProf 36139.7342 10960.2596 3.2973
## disciplineB 11256.4667 12288.9041 0.9160
## sexMale:rankAssocProf 11584.0859 14601.8128 0.7933
## sexMale:rankProf 6358.9751 12097.8169 0.5256
## sexMale:disciplineB -878.9988 13591.8029 -0.0647
## rankAssocProf:disciplineB 17033.5358 17961.8367 0.9483
## rankProf:disciplineB 11146.8426 15610.4704 0.7141
## sexMale:rankAssocProf:disciplineB -10881.9812 19696.1782 -0.5525
## sexMale:rankProf:disciplineB -6408.1889 16853.7872 -0.3802
##
## Residual standard error: 17440 on 385 degrees of freedom
levene <- leveneTest(model_with)
levene
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 11 9.047 2.064e-14 ***
## 385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is < 0.05, which is significant. This means that, there is a significant difference between variances across groups. Therefore, we cannot assume homogeneity of variances in different groups
ggplot(data = augment(model_with), aes(.fitted, .resid)) +
geom_point() +
geom_smooth() +
labs(title = "Spread-Location Plot") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
anova_with <- aov(model_with)
summary(anova_with)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 6.980e+09 6.980e+09 13.460 0.000278 ***
## rank 2 1.371e+11 6.855e+10 132.185 < 2e-16 ***
## discipline 1 1.828e+10 1.828e+10 35.257 6.46e-09 ***
## sex:rank 2 2.352e+08 1.176e+08 0.227 0.797203
## sex:discipline 1 4.558e+08 4.558e+08 0.879 0.349069
## rank:discipline 2 4.748e+08 2.374e+08 0.458 0.632997
## sex:rank:discipline 2 1.324e+08 6.620e+07 0.128 0.880195
## Residuals 385 1.996e+11 5.186e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a non-significant 3-way interaction of sex, rank, and discipline on salary, non-significant 2-way interactions of sex, and rank; sex, and discipline; and rank, and discipline. But sex, rank, and discipline each on their own have a significant interatcion with the salary. Because we have non-significant 3-way interaction, so we have to test if there are any significant 2-way interactions from ANOVA output.
model_without <- rlm(salary ~ sex + rank + discipline, data = Salaries)
summary(model_without)
##
## Call: rlm(formula = salary ~ sex + rank + discipline, data = Salaries)
## Residuals:
## Min 1Q Median 3Q Max
## -64307.1 -11866.1 -599.5 12467.4 99678.9
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 68367.1896 3992.4247 17.1242
## sexMale 3315.3394 3442.2665 0.9631
## rankAssocProf 13798.4661 3530.3459 3.9085
## rankProf 45166.6173 2793.8664 16.1663
## disciplineB 15016.9835 2046.8952 7.3365
##
## Residual standard error: 18340 on 392 degrees of freedom
anova_without <- aov(model_without)
summary(anova_without)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 6.980e+09 6.980e+09 13.62 0.000256 ***
## rank 2 1.371e+11 6.855e+10 133.72 < 2e-16 ***
## discipline 1 1.828e+10 1.828e+10 35.67 5.25e-09 ***
## Residuals 392 2.009e+11 5.126e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the ANOVA without interactions we got same results, as for the ANOVA with interactions - the sex, rank, discipline on their own are significant when it comes to a salary.
model_ancova <- rlm(salary ~ sex * rank * discipline + yrs.since.phd + yrs.service, data = Salaries)
summary(model_ancova)
##
## Call: rlm(formula = salary ~ sex * rank * discipline + yrs.since.phd +
## yrs.service, data = Salaries)
## Residuals:
## Min 1Q Median 3Q Max
## -63691.4 -11134.4 -767.7 11097.3 100814.0
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 72104.5041 8334.3492 8.6515
## sexMale 941.4684 9593.6415 0.0981
## rankAssocProf -1381.8494 13224.9710 -0.1045
## rankProf 32700.8595 11353.4390 2.8803
## disciplineB 10711.1255 12324.2418 0.8691
## yrs.since.phd 325.5432 219.3625 1.4840
## yrs.service -297.8519 193.0934 -1.5425
## sexMale:rankAssocProf 11523.4316 14638.9841 0.7872
## sexMale:rankProf 8638.2335 12186.3798 0.7088
## sexMale:disciplineB 97.6648 13640.9014 0.0072
## rankAssocProf:disciplineB 17364.1048 18022.0699 0.9635
## rankProf:disciplineB 14653.8549 15747.8378 0.9305
## sexMale:rankAssocProf:disciplineB -10986.3153 19750.6521 -0.5563
## sexMale:rankProf:disciplineB -10168.9527 16988.2174 -0.5986
##
## Residual standard error: 16510 on 383 degrees of freedom
ggplot(data = augment(model_ancova), aes(.fitted, .resid)) +
geom_point() +
geom_smooth() +
labs(title = "Spread-Location Plot") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ancova <- aov(model_ancova)
summary(ancova)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 6.980e+09 6.980e+09 13.602 0.000259 ***
## rank 2 1.371e+11 6.855e+10 133.581 < 2e-16 ***
## discipline 1 1.828e+10 1.828e+10 35.630 5.44e-09 ***
## yrs.since.phd 1 1.185e+08 1.185e+08 0.231 0.631081
## yrs.service 1 2.710e+09 2.710e+09 5.281 0.022095 *
## sex:rank 2 2.585e+08 1.292e+08 0.252 0.777482
## sex:discipline 1 5.339e+08 5.339e+08 1.041 0.308346
## rank:discipline 2 5.341e+08 2.671e+08 0.520 0.594664
## sex:rank:discipline 2 2.559e+08 1.280e+08 0.249 0.779430
## Residuals 383 1.965e+11 5.131e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANCOVA we can observe that the 3-way interaction between sex, rank, and discipline is not significant on the salary, because the p-value is equal to 0.77943. It is also a case for the 2-way interactions between sex-rank, sex-discipline, and rank-discipline, where the p-value is equal to 0.77748, 0.30835, and 0.59466 respectively. The yrs.service p-value is 0.02209, which means that it is a significant factor when it comes to the salary, but not as significant as the sex, rank, or discipline on their own. On the other hand, the yrs.since.phd p-value is 0.63108, which means that it is not a significant factor of value of salary.
Robust Linear Model (rlm) to assess if there are any potential significant differences in this variable based on rank.
rank_colors <- c("red", "green", "blue")
boxplot(yrs.since.phd ~ rank, data = Salaries, main = "Years Since PhD by Rank",
xlab = "Rank", ylab = "Years Since PhD", col = rank_colors)
model_phd <- rlm(yrs.since.phd ~ rank, data = Salaries)
summary(model_phd)
##
## Call: rlm(formula = yrs.since.phd ~ rank, data = Salaries)
## Residuals:
## Min 1Q Median 3Q Max
## -16.7577 -4.8558 -0.7577 5.2423 35.1442
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 5.1045 1.1389 4.4818
## rankAssocProf 8.7513 1.6295 5.3706
## rankProf 22.6533 1.2743 17.7766
##
## Residual standard error: 7.771 on 394 degrees of freedom
Robust Linear Model (rlm) to investigate potential significant variations between years of service across different professor ranks
boxplot(yrs.service ~ rank, data = Salaries, main = "Years of Service by Rank",
xlab = "Rank", ylab = "Years of Service", col = rank_colors)
legend("topright", legend = levels(Salaries$rank), fill = rank_colors, title = "Rank")
model_service <- rlm(yrs.service ~ rank, data = Salaries)
summary(model_service)
##
## Call: rlm(formula = yrs.service ~ rank, data = Salaries)
## Residuals:
## Min 1Q Median 3Q Max
## -21.9424 -3.9424 -0.3731 5.0576 42.8597
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 2.3731 1.1651 2.0369
## rankAssocProf 7.7671 1.6669 4.6597
## rankProf 19.5693 1.3036 15.0119
##
## Residual standard error: 7.329 on 394 degrees of freedom
manova_result <- manova(cbind(model_phd$residuals, model_service$residuals) ~ rank, data = Salaries)
summary(manova_result)
## Df Pillai approx F num Df den Df Pr(>F)
## rank 2 0.0032234 0.31802 4 788 0.866
## Residuals 394
From the MANOVA we can see that there is no significant difference in years since PhD and seniority of different rank professors, which we conclude from the p-value that is equal to 0.866, and is far from the signicance level value of 0.05