Salaries data analysis
The data from the carData package consists of 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members.
I) Can years since doctorate (yrs.since.phd), length of service (yrs.service) be significant as covariates?
Descriptive Statistics
Let’s have a general look at the structure and summary of the data:| rank | discipline | yrs.since.phd | yrs.service | sex | salary | |
|---|---|---|---|---|---|---|
| AsstProf : 67 | A:181 | Min. : 1.00 | Min. : 0.00 | Female: 39 | Min. : 57800 | |
| AssocProf: 64 | B:216 | 1st Qu.:12.00 | 1st Qu.: 7.00 | Male :358 | 1st Qu.: 91000 | |
| Prof :266 | Median :21.00 | Median :16.00 | Median :107300 | |||
| Mean :22.31 | Mean :17.61 | Mean :113706 | ||||
| 3rd Qu.:32.00 | 3rd Qu.:27.00 | 3rd Qu.:134185 | ||||
| Max. :56.00 | Max. :60.00 | Max. :231545 |
Also the boxplot shows, that in general the median salary for both genders is higher for professors working in Applied departments in comparison to those who work in Theoretical departments.
| sex | mean_salary | sd_salary | mean_yrs_phd | sd_phd | mean_yrs_service | sd_service |
|---|---|---|---|---|---|---|
| Female | 101002.4 | 25952.13 | 16.51282 | 9.784176 | 11.56410 | 8.813252 |
| Male | 115090.4 | 30436.93 | 22.94693 | 13.036470 | 18.27374 | 13.226234 |
Since salaries data is examined its worth to inspect the distribution of the salary. Salaries distribution is positively skewed.
Following set of plots shows the distribution of the salary for each gender. It is visible that there are much more observations for men and the distribution is closer to normal.
To examine the impact of two explanatory variables on the outcome variable, it could be also checked what are relationships between these two variables in different groups of professors.
It is now investigated what exactly are correlations between two explanatory and one outcome variable salary. It is observed that there is a strong correlation between yrs.since.phd and yrs.service. Pearson’s correlation coefficients by gender:
It can be observed that for each gender yrs.since.phd and yrs.service are strongly correlated. Also these variables seem to be medium correlated with the salary variable.
Assumptions
When doing ANCOVA, some basic assumptions should be checked.
Necessary transformation
Logarithm transformation will be applied to the salary column since its distribution is strongly skewed to the right. Distributions after transformation look as follows.
| sex | variable | statistic | p |
|---|---|---|---|
| Female | salary_log | 0.9562417 | 0.13341 |
| Male | salary_log | 0.9899906 | 0.0152 |
After the transformation, basing on the visual inspection and Shapiro-Wilk test it can be assumed that log of salary is normally distributed for each gender. Further investigation will be conducted using log of the salary.
1) Normality of residuals
| variable | statistic | p.value |
|---|---|---|
| model_phd.metrics$.resid | 0.9914749 | 0.0220476 |
| variable | statistic | p.value |
|---|---|---|
| model_serv.metrics$.resid | 0.9954028 | 0.2931426 |
The Shapiro-Wilk test indicates that assumption of normality of residuals is violated only for the yrs.since.phd proposed covariate. yrs.service has met the assumption of residuals normality since p-value = 0.29 > 0.05.
2) Homogeneity of variance
It should be checked if variance of salary is equal for each group.| df1 | df2 | statistic | p |
|---|---|---|---|
| 1 | 395 | 0.0078427 | 0.9294775 |
p-value = 0.93 > 0.05 so the assumption of the homogeneity of variance is met.
3) Linearity of the regression
Basing on the visual inspection of both scatterplots it can be concluded that there are more less linear relationships between salary log and both years since PhD and years of service.
4) Homogeneity of regression coefficients
This assumption checks that there is no significant interaction between the covariate and the grouping variable. This can be evaluated as follow:
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 sex 1 393 3.563 6.00e-02 0.009
## 2 yrs.since.phd 1 393 81.453 8.09e-18 * 0.172
## 3 sex:yrs.since.phd 1 393 3.994 4.60e-02 * 0.010
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 sex 1 393 4.220 4.10e-02 * 0.011
## 2 yrs.service 1 393 47.192 2.53e-11 * 0.107
## 3 sex:yrs.service 1 393 4.079 4.40e-02 * 0.010
Homogeneity of regression coefficients can be assumed since the interaction term for both models was almost statistically insignificant (p-values of 0.046 and 0.044).
5) Independance of covariate and treatment
In this case it should be verified if sex is not correlated respectively with yrs.since.phd and yrs.service
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 1456 1455.9 8.942 0.00296 **
## Residuals 395 64310 162.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is 0.003 that is less than 0.05, so the covariate years since PhD and the gender are dependent of each other.
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 1583 1583.3 9.562 0.00213 **
## Residuals 395 65403 165.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is 0.002 that is less than 0.05, so the covariate years of service and the gender are dependent of each other. Assumption of independance of covariate and treatment is violated.
ANCOVA
Years since PhD:
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 yrs.since.phd 1 394 80.839 1.04e-17 * 0.170
## 2 sex 1 394 3.536 6.10e-02 0.009
After adjustment for yrs.since.phd, there was no statistically significant difference in salary between between genders, F(1, 394) = 3.536, p > 0.05. Years of service:
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 yrs.service 1 394 46.826 2.98e-11 * 0.106
## 2 sex 1 394 4.187 4.10e-02 * 0.011
After adjustment for yrs.service, there was a statistically significant difference in salary between between genders, F(1, 394) = 4.187, p < 0.05. It can be stated that yrs.service could be a significant covariate
II) Is there any significant difference in years since PhD (yrs.since.phd) and seniority (yrs.service) of different rank professors?
MANOVA analysis should be conducted to asses the impact of the prosessor’s rank on years since PhD and years of service. H0: Group mean vectors are the same for all groups or they don’t differ significantly. H1: At least one of the group mean vectors is different from the rest.
Assumptions
1) Univariate normality
Two dependet variables: yrs.since.phd and yrs.service are checked in terms of normality in each group of explanatory variable rank.| rank | variable | statistic | p |
|---|---|---|---|
| AsstProf | yrs.service | 0.9335151 | 0.00144 |
| AsstProf | yrs.since.phd | 0.9361980 | 0.00192 |
| AssocProf | yrs.service | 0.6913898 | 0 |
| AssocProf | yrs.since.phd | 0.7268879 | 0 |
| Prof | yrs.service | 0.9784598 | 0.00047 |
| Prof | yrs.since.phd | 0.9708954 | 0.00003 |
Basing on the visual inspection of histograms and Q-Q plots together with Shapiro-Wilk test it can be concluded that distributions of variables yrs.since.phd and yrs.service are nor normal.
2) Multivariate normality
| statistic | p.value |
|---|---|
| 0.8772826 | 0 |
The mshapiro test is not is significant (p ~= 0), so we cannot assume multivariate normality.
3) Moderate mutlicollinearity
Ideally the correlation between the outcome variables should be moderate, not too high. A correlation above 0.9 is an indication of multicollinearity, which is problematic for MANOVA.
In other hand, if the correlation is too low, running separate one-way ANOVA for each outcome variable should be considered.
Source: datanovia.com
| var1 | var2 | cor | statistic | p | conf.low | conf.high | method |
|---|---|---|---|---|---|---|---|
| yrs.since.phd | yrs.service | 0.91 | 43.52407 | 0 | 0.8909977 | 0.9252353 | Pearson |
Pearson’s correlation coefficient of yrs.since.phd and yrs.service of the whole population is measured to be 0.91, so it is around the critical region.
However lets assume moderate multicollinearity since in two out of three groups (including the largest one) whe coefficient is below 0.9.
4) Linearity
Assistant Professor:
## [[1]]
Associate Professor:
## [[1]]
Professor:
## [[1]]
Assumption of linearity in each group is met.
5) Homogeneity of covariances
To asses the homogeneity of covariances the Box’s M-test is conducted.| statistic | p.value | parameter | method |
|---|---|---|---|
| 265.3564 | 0 | 6 | Box’s M-test for Homogeneity of Covariance Matrices |
The test is statistically significant (p < 0.05), so the data have violated the assumption of homogeneity of variance-covariance matrices.
Anallysis will be continued, but using Pillai’s multivariate statistic instead of Wilks’ statistic.
6) Homogeneity of variances
## # A tibble: 2 x 5
## variable df1 df2 statistic p
## <chr> <int> <int> <dbl> <dbl>
## 1 yrs.service 2 394 39.5 2.37e-16
## 2 yrs.since.phd 2 394 35.2 8.91e-15
The Levene’s test is significant (p < 0.05), so there was no homogeneity of variances.
The analysis will be continued without outcome variable transformations, but it will be necessary to accept a lower level of statistical significance (alpha level) for MANOVA result.
MANOVA
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## rank 2 0.49855 65.414 4 788 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using Pillai test statistic and lowered significance level due to assumptions violation it is concluded as follows.
There was a statistically significant difference between the rank of the professor on the combined dependent variables (years since phd and years of service), F(4, 788) = 65.414, p < 0.0001.
Post-hoc tests
To find out where the interaction occurred post-hoc tests will be conducted.
One-way ANOVA
One-way ANOVA is conducted for each outcome variable separately.
## # A tibble: 2 x 8
## variable Effect DFn DFd F p `p<.05` ges
## * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 yrs.service rank 2 394 116. 2.61e-40 * 0.37
## 2 yrs.since.phd rank 2 394 191. 9.35e-59 * 0.492
There was a statistically significant difference in yrs.since.phd (F(2, 394) = 191.177, p < 0.0001 ) between professors’ ranks.
There was a statistically significant difference in yrs.service (F(2, 394) = 115.896, p < 0.0001 ) between professors’ ranks. As we have two dependent variables, we need to apply Bonferroni multiple testing correction by decreasing the he level we declare statistical significance. This is done by dividing classic alpha level (0.05) by the number of tests (or dependent variables, here 2). This leads to a significance acceptance criteria of p < 0.025 rather than p < 0.05 because there are two dependent variables. Source: datanovia.com
Multiple pairwise comparisons
Pairwise comparisons follow up the ANOVA analysis to asses which groups are influenced by examined variables.| variables | .y. | group1 | group2 | p.adj | p.adj.signif |
|---|---|---|---|---|---|
| yrs.service | value | AsstProf | AssocProf | 0 | **** |
| yrs.service | value | AsstProf | Prof | 0 | **** |
| yrs.service | value | AssocProf | Prof | 0 | **** |
| yrs.since.phd | value | AsstProf | AssocProf | 0 | **** |
| yrs.since.phd | value | AsstProf | Prof | 0 | **** |
| yrs.since.phd | value | AssocProf | Prof | 0 | **** |
| variable | group1 | group2 | p.adj.signif |
|---|---|---|---|
| yrs.service | AsstProf | AssocProf | **** |
| yrs.service | AsstProf | Prof | **** |
| yrs.service | AssocProf | Prof | **** |
| yrs.since.phd | AsstProf | AssocProf | **** |
| yrs.since.phd | AsstProf | Prof | **** |
| yrs.since.phd | AssocProf | Prof | **** |
Both Games-Howell post-hoc test and pairwise-t-test with unpooled sd and unequal variances show there are significant interactions for each combination.
All pairwise comparisons were significant for each of the outcome variable. After post-hoc tests it can be stated that there are significant differences in years since PhD and seniority for different rank professors.