library(carData)
data("Salaries")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
Salaries %>%
group_by(sex, rank, discipline) %>%
get_summary_stats(salary, type = "common")
## # A tibble: 12 × 13
## rank disci…¹ sex varia…² n min max median iqr mean sd
## <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AsstPr… A Fema… salary 6 63100 78500 73000 4000 7.29e4 5463.
## 2 AsstPr… B Fema… salary 5 74692 97032 80225 15000 8.42e4 9792.
## 3 AssocP… A Fema… salary 4 62884 77500 74065 4802. 7.21e4 6403.
## 4 AssocP… B Fema… salary 6 71065 109650 103872 758. 9.94e4 14086.
## 5 Prof A Fema… salary 8 90450 137000 109800 15226. 1.10e5 15095.
## 6 Prof B Fema… salary 10 105450 161101 128256 20214. 1.32e5 17504.
## 7 AsstPr… A Male salary 18 63900 85000 74000 3086 7.43e4 4580.
## 8 AsstPr… B Male salary 38 68404 95079 85408 10906. 8.46e4 6900.
## 9 AssocP… A Male salary 22 70000 108413 82350 6835 8.50e4 10612.
## 10 AssocP… B Male salary 32 83900 126431 100730 10231 1.02e5 9608.
## 11 Prof A Male salary 123 57800 205500 113278 34290 1.21e5 28505.
## 12 Prof B Male salary 125 67559 231545 132825 35876 1.34e5 26514.
## # … with 2 more variables: se <dbl>, ci <dbl>, and abbreviated variable names
## # ¹discipline, ²variable
We print dataset summary to get overal grasp on our dataset.
library(ggpubr)
## Loading required package: ggplot2
ggboxplot(Salaries, x = "discipline", y = "salary",
color = "rank", palette = "aas", facet.by = "sex")+
labs(title = "Dependency of salary in relation to other factors",
x = "discipline",
y = "Salary") +
theme_light()
## Outliers
outliers<-Salaries %>%
group_by(sex, rank, discipline) %>%
identify_outliers(salary)
outliers
## # A tibble: 18 × 8
## rank discipline sex yrs.since.phd yrs.service salary is.out…¹ is.ex…²
## <fct> <fct> <fct> <int> <int> <int> <lgl> <lgl>
## 1 AsstProf A Female 7 6 63100 TRUE FALSE
## 2 AssocProf A Female 25 22 62884 TRUE FALSE
## 3 AssocProf B Female 14 7 109650 TRUE TRUE
## 4 AssocProf B Female 12 9 71065 TRUE TRUE
## 5 AsstProf A Male 3 1 63900 TRUE FALSE
## 6 AsstProf A Male 2 0 85000 TRUE TRUE
## 7 AsstProf A Male 8 4 81035 TRUE FALSE
## 8 AssocProf A Male 14 8 100102 TRUE FALSE
## 9 AssocProf A Male 9 7 70000 TRUE FALSE
## 10 AssocProf A Male 11 1 104800 TRUE FALSE
## 11 AssocProf A Male 45 39 70700 TRUE FALSE
## 12 AssocProf A Male 10 1 108413 TRUE FALSE
## 13 AssocProf A Male 11 8 104121 TRUE FALSE
## 14 AssocProf B Male 13 11 126431 TRUE FALSE
## 15 Prof A Male 29 7 204000 TRUE FALSE
## 16 Prof A Male 42 18 194800 TRUE FALSE
## 17 Prof A Male 43 43 205500 TRUE FALSE
## 18 Prof B Male 38 38 231545 TRUE FALSE
## # … with abbreviated variable names ¹is.outlier, ²is.extreme
Testing for outliers in our dataset inside salary data. There are 18 of outliers in 397 observations. There are some extreme up to twice of mean size.
Salaries %>%
group_by(discipline, rank, sex) %>%
shapiro_test(salary)
## # A tibble: 12 × 6
## rank discipline sex variable statistic p
## <fct> <fct> <fct> <chr> <dbl> <dbl>
## 1 AsstProf A Female salary 0.870 0.226
## 2 AsstProf A Male salary 0.941 0.300
## 3 AssocProf A Female salary 0.863 0.269
## 4 AssocProf A Male salary 0.878 0.0113
## 5 Prof A Female salary 0.934 0.549
## 6 Prof A Male salary 0.952 0.000259
## 7 AsstProf B Female salary 0.889 0.354
## 8 AsstProf B Male salary 0.941 0.0458
## 9 AssocProf B Female salary 0.635 0.00117
## 10 AssocProf B Male salary 0.967 0.416
## 11 Prof B Female salary 0.974 0.923
## 12 Prof B Male salary 0.978 0.0435
Not all of the groups have normal distribution. We have to remove them to clean our dataset.
Salaries$salary[Salaries$salary%in%outliers$salary] <- NA
Salaries <- na.omit(Salaries)
Salaries %>%
levene_test(salary~sex*discipline*rank)
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 11 366 10.8 2.39e-17
model<-lm(salary~sex*discipline*rank,data=Salaries)
plot(model, 1)
Printed model shows us relationship between residuals and fitted values.
Levene test given us really low p-value. There is difference between
variances.So we cannot assume homogenity.
results <- aov(salary ~ sex*rank*discipline, data=Salaries)
anova(results)
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 3.7957e+09 3.7957e+09 8.4300 0.003915 **
## rank 2 1.2059e+11 6.0294e+10 133.9110 < 2.2e-16 ***
## discipline 1 2.0151e+10 2.0151e+10 44.7553 8.383e-11 ***
## sex:rank 2 1.9832e+08 9.9160e+07 0.2202 0.802440
## sex:discipline 1 2.2091e+08 2.2091e+08 0.4906 0.484089
## rank:discipline 2 6.2222e+08 3.1111e+08 0.6910 0.501745
## sex:rank:discipline 2 1.5337e+08 7.6686e+07 0.1703 0.843464
## Residuals 366 1.6479e+11 4.5026e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is significance in all of tested variables (sex, rank, discipline), but no interference between them.
anova(aov(data = Salaries, salary ~ yrs.since.phd + rank * discipline * sex))
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.since.phd 1 5.0299e+10 5.0299e+10 111.4067 < 2.2e-16 ***
## rank 2 7.4563e+10 3.7282e+10 82.5754 < 2.2e-16 ***
## discipline 1 1.9486e+10 1.9486e+10 43.1600 1.741e-10 ***
## sex 1 1.8904e+08 1.8904e+08 0.4187 0.5180
## rank:discipline 2 6.3424e+08 3.1712e+08 0.7024 0.4961
## rank:sex 2 1.5269e+08 7.6344e+07 0.1691 0.8445
## discipline:sex 1 2.5477e+08 2.5477e+08 0.5643 0.4530
## rank:discipline:sex 2 1.5257e+08 7.6284e+07 0.1690 0.8446
## Residuals 365 1.6479e+11 4.5149e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We test if “years from phd” is a significant factor as covariance in ANCOVA. We see that it affects our data.
anova(aov(data = Salaries, salary ~ yrs.service + rank * discipline * sex))
## Analysis of Variance Table
##
## Response: salary
## Df Sum Sq Mean Sq F value Pr(>F)
## yrs.service 1 3.3335e+10 3.3335e+10 74.0038 2.341e-16 ***
## rank 2 9.2014e+10 4.6007e+10 102.1359 < 2.2e-16 ***
## discipline 1 1.9289e+10 1.9289e+10 42.8208 2.033e-10 ***
## sex 1 2.4149e+08 2.4149e+08 0.5361 0.4645
## rank:discipline 2 6.0649e+08 3.0325e+08 0.6732 0.5107
## rank:sex 2 1.8637e+08 9.3187e+07 0.2069 0.8132
## discipline:sex 1 2.8005e+08 2.8005e+08 0.6217 0.4309
## rank:discipline:sex 2 1.5859e+08 7.9294e+07 0.1760 0.8387
## Residuals 365 1.6441e+11 4.5045e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Years of service affect our dataset as well.
model <- lm(cbind(yrs.since.phd, yrs.service) ~ salary, Salaries)
Manova(model, test.statistic = "Hotelling")
##
## Type II MANOVA Tests: Hotelling-Lawley test statistic
## Df test stat approx F num Df den Df Pr(>F)
## salary 1 0.20613 38.65 2 375 5.474e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have low p-value (much lower than 0.05). There is a difference in “years of service” and “years since phd” in different ranks of proffesors.