Report

Daria Skarbek, 184869

Published on 29.01.2022

TASK

The Salaries dataset from the carData consists of nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.

1. Perform the 3-way Anova with and w/o interactions. Interpret the results.
2. Can years since doctorate (yrs.since.phd), length of service (yrs.service) be significant as covariates?
3. Is there any significant difference in years since PhD (yrs.since.phd) and seniority (yrs.service) of different rank professors?

3-WAY ANOVA

In the first point, we want to perform 3-way Anova test. We will start with understanding the data.

Understanding the data

At the beginning we will add an identifier for each observation - it will be useful later, when dealing with outliers.

sal <- Salaries

v <- c(1:nrow(sal))
sal <- cbind(v, sal)

a <- sal %>%
  group_by(rank, discipline, sex) %>%
  get_summary_stats(salary, type = "mean")
Summary of salaries across Professors at the university
rank discipline sex variable n mean
AsstProf A Female salary 6 72933.33
AsstProf A Male salary 18 74269.61
AsstProf B Female salary 5 84189.80
AsstProf B Male salary 38 84647.08
AssocProf A Female salary 4 72128.50
AssocProf A Male salary 22 85048.86
AssocProf B Female salary 6 99435.67
AssocProf B Male salary 32 101621.53
Prof A Female salary 8 109631.88
Prof A Male salary 123 120619.26
Prof B Female salary 10 131836.20
Prof B Male salary 125 133518.36

We create a summary table using get_summary_stats to observe mean values of observations. We can see that the increasing trend in salaries is visible when looking at rank of teachers - Assistant Professors earn less than Professors. There is also a difference in discipline - in each case professors from group B earn more than those in group A. Another trend may be seen in gender of professors - men tend to earn more than women. This will be three factors we will consider when performing 3-way Anova: rank, discipline, sex.
Visualization of those variables is visible on a plot below:

ggplot(data = sal, aes(y = salary, x = discipline, color = sex)) + geom_boxplot() + 
  facet_wrap(~rank) + xlab("") + ylab("Salary") 

Assumptions

To be able to perform 3-way Anova, we must check following assumptions: - not significant outliers
- samples come from populations that are normally distributed - normality check
- homogeneity of variances in groups - check of homoscedasticity

  1. Outliers

We will try to identify outliers in our data.

outliers_data <- sal %>%
  group_by(sex, rank, discipline) %>%
  identify_outliers(salary)

x <- outliers_data %>% filter(is.outlier == 'TRUE')
x <- x[, 4]
x <- as.numeric(unlist(x))
sal1 <- sal[-which(sal$v %in% x),]
Outliers found in the dataset
rank discipline sex v yrs.since.phd yrs.service salary is.outlier is.extreme
AsstProf A Female 238 7 6 63100 TRUE FALSE
AssocProf A Female 124 25 22 62884 TRUE FALSE
AssocProf B Female 219 14 7 109650 TRUE TRUE
AssocProf B Female 317 12 9 71065 TRUE TRUE
AsstProf A Male 227 3 1 63900 TRUE FALSE
AsstProf A Male 288 2 0 85000 TRUE TRUE
AsstProf A Male 397 8 4 81035 TRUE FALSE
AssocProf A Male 141 14 8 100102 TRUE FALSE
AssocProf A Male 228 9 7 70000 TRUE FALSE
AssocProf A Male 294 11 1 104800 TRUE FALSE
AssocProf A Male 300 45 39 70700 TRUE FALSE
AssocProf A Male 368 10 1 108413 TRUE FALSE
AssocProf A Male 380 11 8 104121 TRUE FALSE
AssocProf B Male 323 13 11 126431 TRUE FALSE
Prof A Male 250 29 7 204000 TRUE FALSE
Prof A Male 272 42 18 194800 TRUE FALSE
Prof A Male 365 43 43 205500 TRUE FALSE
Prof B Male 44 38 38 231545 TRUE FALSE

We chose to delete all outliers to make the data easier for later calculations.

  1. Normality

To check normality, we will perform shapiro test and visualize the QQ-plots.

a <- sal1 %>%
  group_by(rank, discipline, sex) %>%
  shapiro_test(salary)
Shapiro Wilk Test
rank discipline sex variable statistic p
AsstProf A Female salary 0.8127783 0.1025736
AsstProf A Male salary 0.9534922 0.5810276
AsstProf B Female salary 0.8892976 0.3535834
AsstProf B Male salary 0.9411888 0.0457877
AssocProf A Female salary 0.9760308 0.7031217
AssocProf A Male salary 0.8994292 0.0786954
AssocProf B Female salary 0.9157855 0.5135847
AssocProf B Male salary 0.9760940 0.6978685
Prof A Female salary 0.9335779 0.5491953
Prof A Male salary 0.9667284 0.0045659
Prof B Female salary 0.9737136 0.9229797
Prof B Male salary 0.9856971 0.2179443
ggqqplot(residuals(lm(salary ~ rank*discipline*sex, data = sal1))) + theme_minimal() + 
  ggtitle("Residuals for checking normality assumption")

In Shapiro-Wilk test we observe two groups where normality assumption is not met. The QQ plot also does not allow us to see perfectly normal distribution.

  1. Homogeneity of variances

To check homogeneity of variance we use Levene’s Test.

a <- leveneTest(salary ~ sex * rank * discipline, data = sal1)
Levene’s Test for homogeneity of variance
Df F value Pr(>F)
group 11 10.91118 0
367 NA NA

As we can observe, we cannot accept two main assumptions - normality and homogeneity of variance. P-value from Levene’s Test is basically 0. Hence, we will try making some transformations to the data.

Data transformation

Firstly, we will try to change the data by taking log of salaries. We want to try to find a solution when a 3-way ANOVA can be performed.

Transformed data will again go through assumptions checking.

a <- sal22 %>%
  group_by(rank, discipline, sex) %>%
  shapiro_test(salary)
Shapiro Test for transformed data
rank discipline sex variable statistic p
AsstProf A Female salary 0.8133727 0.1036839
AsstProf A Male salary 0.9153652 0.2498065
AsstProf B Female salary 0.8957397 0.3867889
AsstProf B Male salary 0.9311392 0.0217904
AssocProf A Female salary 0.9782176 0.7170927
AssocProf A Male salary 0.9326272 0.4742249
AssocProf B Female salary 0.9163571 0.5167221
AssocProf B Male salary 0.9810190 0.8404347
Prof A Female salary 0.9363242 0.5753139
Prof A Male salary 0.9828775 0.1351391
Prof B Female salary 0.9758636 0.9392650
Prof B Male salary 0.9832500 0.1318367
ggqqplot(residuals(lm(salary ~ rank*discipline*sex, data = sal22))) + theme_minimal() + 
  ggtitle("Residuals for checking normality assumption")

a <- leveneTest(data = sal22, salary ~ rank * discipline * sex)
Levene’s Test for transformed data
Df F value Pr(>F)
group 11 9.418411 0
356 NA NA

By taking log of salaries we can observe that most of connections rank * discipline * sex turns out to have normal distribution. The only problem occurs with one subgroup. However, we did not manage to increase p-value for homogeneity of variance to be able to accept this assumption.

Hence, we will perform the calculations in two ways. First, we will use log(salaries) to perform a 3-way ANOVA as was asked, even though assumptions are not met. We will assume small sample with not normal distribution does not reflects real population parameters. In second part, we will use non parametric test.

3-way Anova

First, performing 3-way Anova with interactions. We will begin by stating null hypothesis:

  • H0A : Factor A = rank has no effect on the salary of employees in the university.

  • H0B : Factor B = discipline the employees work in has no effect on the salary.

  • H0C : Factor C = sex, gender of employees has no effect on the salary amount.

  • H0AxB : There is no interaction between rank and discipline that has an affect on the salary they get.

  • H0AxC : There is no interaction between rank and sex of employees that has an affect on the salary they get.

  • H0BxC : There is no interaction between discipline and sex of employees performed on participants that has an affect on the salary they get.

  • H0AxBxC : There is no interaction between rank, discipline and sex that has an affect on the salary the employees get.

Now we can create an Anova model and check the null hypothesis.

model1 <- aov(data = sal22, salary ~ rank * discipline * sex)
a <- anova(model1)
3-way Anova with interactions
Df Sum Sq Mean Sq F value Pr(>F)
rank 2 9.9298550 4.9649275 176.6799363 0.0000000
discipline 1 1.5383121 1.5383121 54.7417639 0.0000000
sex 1 0.0165964 0.0165964 0.5905943 0.4426989
rank:discipline 2 0.0849683 0.0424842 1.5118241 0.2219196
rank:sex 2 0.0060102 0.0030051 0.1069385 0.8986097
discipline:sex 1 0.0234150 0.0234150 0.8332386 0.3619555
rank:discipline:sex 2 0.0177163 0.0088582 0.3152232 0.7298294
Residuals 356 10.0040459 0.0281013 NA NA

From performing 3-way Anova with interactions we can observe these results:
- Rank and discipline have very significant influence on salary value. Hence, we can conclude that there is a significant difference between salaries of Professors, Assistant Professors and Associate Professors as well as between all employees in department A and department B.

  • Sex does not turn out to have significant impact on salary. Hence, the null hypothesis cannot be rejected and we do not assume any difference between males and females employees.

  • There is no significant interaction between any pair of variables: rank-discipline, rank-sex, discipline-sex, nor rank-discipline-sex. In each of these cases p-value is higher than our significance level. Hence, null hypothesis about existing significant interactions between variables are all not rejected.

model2 <- aov(data = sal22, salary ~ rank + discipline + sex)
a <- anova(model2)
3-way Anova without interactions
Df Sum Sq Mean Sq F value Pr(>F)
rank 2 9.9298550 4.9649275 177.8059375 0.0000000
discipline 1 1.5383121 1.5383121 55.0906394 0.0000000
sex 1 0.0165964 0.0165964 0.5943583 0.4412402
Residuals 363 10.1361558 0.0279233 NA NA

If we wanted to perform 3-way ANOVA without interactions, we will get only three first values from previous calculations. We only check statements whether rank, discipline or sex has any significant impact on the salary of employees - but without mixing the factors. We can again observe that significant values are reported when taking about rank and discipline, however not with gender.

Post-hoc test

As we managed to reject null hypothesis, we will now perform a post-hoc test to check which pairs of possible ranks, disciplines and genders had significant influence on reject H0.

We should check:

  • Simple two-way interaction: run two-way interaction at each level of third variable,

  • Simple simple main effect: run one-way model at each level of second variable,

  • Simple simple pairwise comparisons: run pairwise or other post-hoc comparisons if necessary.

As we had non significant 3-way interaction nor any significant 2-way interaction, we will only focus on finding main effects.

sal2grouped <- sal2 %>% unite(col = "group", rank, discipline, sex)
pair <- pairwise_t_test(data = sal2grouped, salary ~ group, 
                        p.adjust.method = "bonferroni")
Pairwise Comparizon
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
salary AssocProf_A_Female AssocProf_A_Male 3 16 0.429000 ns 1.0000 ns
salary AssocProf_A_Female AssocProf_B_Female 3 4 0.014600
0.9640 ns
salary AssocProf_A_Male AssocProf_B_Female 16 4 0.014300
0.9450 ns
salary AssocProf_A_Female AssocProf_B_Male 3 31 0.005870 ** 0.3870 ns
salary AssocProf_A_Male AssocProf_B_Male 16 31 0.000154 *** 0.0101
salary AssocProf_B_Female AssocProf_B_Male 4 31 0.709000 ns 1.0000 ns
salary AssocProf_A_Female AsstProf_A_Female 3 5 0.972000 ns 1.0000 ns
salary AssocProf_A_Male AsstProf_A_Female 16 5 0.307000 ns 1.0000 ns
salary AssocProf_B_Female AsstProf_A_Female 4 5 0.004880 ** 0.3220 ns
salary AssocProf_B_Male AsstProf_A_Female 31 5 0.000468 *** 0.0309

From performing pair-wise comparison between grouped three variables we can observe significant interaction between most of the pairs. Taking from first ten for example between AssocProf_A_Female and AssocProf_B_Female -> what supports the fact that null hypothesis stating that discipline has significant impact on salary score. As there are as many as 66 combinations (12 * 11 / 2) we will only show first 10 for observing the p-value.

Non Parametric Anova

Second possibility we can use is to perform a non parametric version of Anova test. As there exist only an option to do 1-way non parametric Anova - Kruskal-Wallis Anova we will use grouped variable corresponding to rank, discipline and sex at once.

sal1grouped <- sal1 %>% unite(col = "group", rank, discipline, sex)
kruskal.test(data = sal1grouped, salary ~ group)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  salary by group
## Kruskal-Wallis chi-squared = 210.93, df = 11, p-value < 2.2e-16

Using this test, we managed to again reject null hypothesis. However, the only think we know thanks to that is that the rank, discipline a person works in and his/her gender have significant impact on their salary value.

ANCOVA

In this task we also wanted to take a look at possible covariates - years since phd and years of working at the university. We are interested whether these values should be taken into account when talking about salaries.

Plots

We would like to visualize possible connections between covariates, salary and other variables values.

ggplot(data = sal22, aes(y = salary, x = yrs.since.phd, color = sex)) + geom_point() + 
  facet_wrap(~discipline~rank) + xlab("") + ylab("Salary") +
  ggtitle('Years since phd vs salary for different groups of Professors')

ggplot(data = sal22, aes(y = salary, x = yrs.service, color = sex)) + geom_point() + 
  facet_wrap(~discipline~rank) + xlab("") + ylab("Salary") +
  ggtitle('Years of service vs for different groups of Professors')

On both plots we can see that the Professors are mostly spread - in both years of service / since phd and salary. Assistant Professors earn less and also have the lowest seniority.

Years since phd

In this test we should also take a look at Ancova assumptions. Most of them were checked in previous exercise, however we will also want to take a look at few others. We are using the log(salary) to be closer to our normality and homogeneity assumptions.

Here we want to check whether the normality assumption of covariate (yrs.since.phd) is met.

ggscatter(sal22, x = "yrs.since.phd", y = "salary", 
          facet.by = c("rank", "discipline")) + stat_smooth(method = "loess")

On this plot we can also observe how years since phd and log(salary) looks like in terms of different ranks and disciplines of employees.

We also must take a look if our covariate variable is independent. To do that, we will perform a Anova test.

a <- anova(aov(salary ~ group * yrs.since.phd, data = sal2grouped))
Anova for covariate test
Df Sum Sq Mean Sq F value Pr(>F)
group 11 12.4732241 1.1339295 37.3768354 0.0000000
yrs.since.phd 1 0.0191795 0.0191795 0.6321975 0.4270817
group:yrs.since.phd 11 0.1896278 0.0172389 0.5682321 0.8544456
Residuals 355 10.7699048 0.0303378 NA NA

Here we can observe that interaction between grouped variable and our covariate is not significant (0.85). Hence, this assumption is fulfiled.

Ancova Test

ancova <- aov(salary ~ yrs.since.phd + rank * discipline * sex, data = sal22)
a <- anova(ancova)
Ancova Test when controling for years since PhD
Df Sum Sq Mean Sq F value Pr(>F)
yrs.since.phd 1 4.2386151 4.2386151 150.4140707 0.0000000
rank 2 5.7370542 2.8685271 101.7942950 0.0000000
discipline 1 1.4934390 1.4934390 52.9970837 0.0000000
sex 1 0.0160431 0.0160431 0.5693152 0.4510314
rank:discipline 2 0.0850348 0.0425174 1.5087987 0.2225903
rank:sex 2 0.0058061 0.0029030 0.1030192 0.9021366
discipline:sex 1 0.0233634 0.0233634 0.8290876 0.3631543
rank:discipline:sex 2 0.0177897 0.0088948 0.3156474 0.7295210
Residuals 355 10.0037740 0.0281796 NA NA

Having performed this Ancova test, we can observe that years since phd does have strong impact on salary value. We could have observed that on also previous plots, hence we reject the null hypothesis.

Years of service

Now we will take a look at other suspicious variable - years of service.

First, ploting it agains salary.

ggscatter(sal22, x = "yrs.service", y = "salary", 
          facet.by = c("rank", "discipline")) + stat_smooth(method = "loess")

a <- anova(aov(salary ~ group * yrs.service, data = sal2grouped))
Anova for covariate test
Df Sum Sq Mean Sq F value Pr(>F)
group 11 12.4732241 1.1339295 38.076712 0.0000000
yrs.service 1 0.0729543 0.0729543 2.449765 0.1184334
group:yrs.service 11 0.3338114 0.0303465 1.019018 0.4287228
Residuals 355 10.5719463 0.0297801 NA NA

We can observe that our variable is independent on group. Hence, we can fulfill that assumption.

ancova2 <- aov(salary ~ yrs.service + rank * discipline * sex, data = sal22)
a <- anova(ancova2)
Ancova Test when controling for years of service
Df Sum Sq Mean Sq F value Pr(>F)
yrs.service 1 2.9106921 2.9106921 103.3961891 0.0000000
rank 2 7.0785977 3.5392989 125.7261165 0.0000000
discipline 1 1.4849723 1.4849723 52.7505030 0.0000000
sex 1 0.0189667 0.0189667 0.6737516 0.4122975
rank:discipline 2 0.0845366 0.0422683 1.5014919 0.2242090
rank:sex 2 0.0070863 0.0035431 0.1258626 0.8817753
discipline:sex 1 0.0247324 0.0247324 0.8785674 0.3492316
rank:discipline:sex 2 0.0177785 0.0088893 0.3157720 0.7294303
Residuals 355 9.9935568 0.0281509 NA NA

This Ancova test also shows significant impact of years of service on salary value.

Hence, both our suspictious variables turned out to be good covariates as they should be taken into account when talking about salary of professors.

KRUSKAL-WALLIS TEST

Is there any significant difference in years since PhD (yrs.since.phd) and seniority (yrs.service) of different rank professors?

Plots & Data

In the last task, we were asked to find whether there exists a difference between years since PhD and years of service of different rank professors. We will start with ploting the data.

ggplot(data = sal1, aes(x = yrs.since.phd, y = yrs.service)) + geom_point() + 
  facet_grid(~rank) + xlab('Years since PhD') + ylab('Years of service') + 
  ggtitle('Years since PhD vs years of service for different Professors')

To check last exercise, we will create a new variable called diff. This will keep the value of difference between years since PhD - years of service. Positive value will correspond to a professor that started his/her learning adventure after completing PhD. Negative value will mean that he/she was teaching before that.

saldiff <- sal1 %>% mutate(diff = yrs.since.phd - yrs.service)

a <- saldiff %>%
  group_by(rank) %>%
  get_summary_stats(diff, type = "mean")
Summary of difference for different Professors
rank variable n mean
AsstProf diff 63 2.762
AssocProf diff 54 3.204
Prof diff 262 5.393

Summary of data can show us that there occurs to be some difference between ranks. Professors tend to have bigger mean difference between years since PhD and years of service. We must take into account however, that these people are usually older and tend to have worked longer. However, we cannot check that using our dataset.

ggplot(data = saldiff, aes(y = diff, x = rank, color = rank)) + geom_boxplot() + 
  theme(legend.position = 'none') + xlab('') + ylab('yrs since PhD - yrs of service') + 
  ggtitle('Difference between years since PhD and years of service for different Professors')

Assumptions

In this situation we would like to perform 1-way ANOVA. However, first we must take a look at assumptions.

  • samples are independent
  • samples come from populations that are normally distributed - normality check
  • homogeneity of variances in groups - check of homoscedasticity
  • not significant outliers
  1. Outliers check
outliers_data <- saldiff %>%
  group_by(sex, rank, discipline) %>%
  identify_outliers(salary)

x <- outliers_data %>% filter(is.extreme == 'TRUE')
x <- x[, 4]
x <- as.numeric(unlist(x))

saldiff2 <- saldiff[-which(saldiff$v %in% x),]
Outliers Data
rank discipline sex v yrs.since.phd yrs.service salary diff is.outlier is.extreme
AsstProf A Male 235 8 3 69700 5 TRUE FALSE
AsstProf A Male 241 5 3 69200 2 TRUE FALSE
AsstProf A Male 360 11 4 78785 7 TRUE FALSE
AssocProf A Male 139 10 7 73877 3 TRUE TRUE
AssocProf A Male 258 30 23 74000 7 TRUE TRUE
AssocProf A Male 261 41 33 88600 8 TRUE TRUE
AssocProf A Male 285 8 6 88650 2 TRUE TRUE
AssocProf A Male 371 13 8 78182 5 TRUE FALSE
AssocProf A Male 383 8 5 86895 3 TRUE FALSE
Prof A Male 390 33 18 186023 15 TRUE FALSE

As we found some significant outliers, we will delete them from our dataset.

  1. Normality check
a <- saldiff2 %>%
  group_by(rank) %>%
  shapiro_test(diff)
Shapiro Wilk Test
rank variable statistic p
AsstProf diff 0.9077412 0.0001785
AssocProf diff 0.9089302 0.0009604
Prof diff 0.9064980 0.0000000
qqnorm(saldiff2$diff, main="QQ plot", pch=19)
qqline(saldiff2$diff)

As we can observe, we are not dealing with normal data. We will check homoscedasticity, however even then we will use non - parametric test.

  1. Homoscedasticity
a <- leveneTest(data = saldiff2, diff ~ rank)
Levene’s Test for homoscedasticity of variance
Df F value Pr(>F)
group 2 17.28444 1e-07
372 NA NA

Our data do not contain homogeneous variance. Hence, we must perform non parametric test - Kruskal-Wallis.

Kruskal-Wallis Test

kruskal.test(data = saldiff2, diff ~ rank)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  diff by rank
## Kruskal-Wallis chi-squared = 8.4193, df = 2, p-value = 0.01485

We obtained a very nice value of p 0.01 -> meaning with our significance level 0.05 we can reject null hypothesis. Hence, there is a difference in differences between years since PhD and years of service for different rank professors.